# using warnings module to ignore warnings
import warnings
warnings.filterwarnings('ignore')
# using warnings module to ignore warnings
import warnings
warnings.filterwarnings('ignore') Start coding or generate with AI.
# Developed and Tested on Python=3.8.10
!pip install numpy==1.24.2
!pip install pandas==1.5.3
!pip install matplotlib==3.7.0
!pip install seaborn==0.12.2
!pip install scikit_learn==1.2.2
!pip install projectpro --upgrade Start coding or generate with AI.
Requirement already satisfied: numpy==1.24.2 in /usr/local/lib/python3.11/dist-packages (1.24.2) Requirement already satisfied: pandas==1.5.3 in /usr/local/lib/python3.11/dist-packages (1.5.3) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.11/dist-packages (from pandas==1.5.3) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas==1.5.3) (2025.2) Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.11/dist-packages (from pandas==1.5.3) (1.24.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.1->pandas==1.5.3) (1.17.0) Requirement already satisfied: matplotlib==3.7.0 in /usr/local/lib/python3.11/dist-packages (3.7.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (1.3.2) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (4.57.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (1.4.8) Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (1.24.2) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (24.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (11.2.1) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (3.2.3) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib==3.7.0) (2.9.0.post0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib==3.7.0) (1.17.0) Requirement already satisfied: seaborn==0.12.2 in /usr/local/lib/python3.11/dist-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in /usr/local/lib/python3.11/dist-packages (from seaborn==0.12.2) (1.24.2) Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.11/dist-packages (from seaborn==0.12.2) (1.5.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /usr/local/lib/python3.11/dist-packages (from seaborn==0.12.2) (3.7.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (1.3.2) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (4.57.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (1.4.8) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (24.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (11.2.1) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (3.2.3) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=0.25->seaborn==0.12.2) (2025.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn==0.12.2) (1.17.0) Requirement already satisfied: scikit_learn==1.2.2 in /usr/local/lib/python3.11/dist-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.11/dist-packages (from scikit_learn==1.2.2) (1.24.2) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.11/dist-packages (from scikit_learn==1.2.2) (1.15.3) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from scikit_learn==1.2.2) (1.5.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from scikit_learn==1.2.2) (3.6.0) Requirement already satisfied: projectpro in /usr/local/lib/python3.11/dist-packages (0.0.30) Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from projectpro) (2.32.3) Requirement already satisfied: geocoder in /usr/local/lib/python3.11/dist-packages (from projectpro) (1.38.1) Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from geocoder->projectpro) (8.1.8) Requirement already satisfied: future in /usr/local/lib/python3.11/dist-packages (from geocoder->projectpro) (1.0.0) Requirement already satisfied: ratelim in /usr/local/lib/python3.11/dist-packages (from geocoder->projectpro) (0.1.6) Requirement already satisfied: six in /usr/local/lib/python3.11/dist-packages (from geocoder->projectpro) (1.17.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->projectpro) (3.4.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->projectpro) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->projectpro) (2.4.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->projectpro) (2025.4.26) Requirement already satisfied: decorator in /usr/local/lib/python3.11/dist-packages (from ratelim->geocoder->projectpro) (4.4.2)
# exploring the top 5 rows
df.head(n=5) Start coding or generate with AI.
# shape of the dataset
df.shape Start coding or generate with AI.
(653753, 74)
# finding the columns of the dataset
df.columns Start coding or generate with AI.
Index(['Customer ID', 'Month of Joining', 'Month', 'zip_code', 'Gender', 'Age',
'Married', 'Dependents', 'Number of Dependents', 'Location ID',
'Service ID', 'state', 'county', 'timezone', 'area_codes', 'country',
'latitude', 'longitude', 'arpu', 'roam_ic', 'roam_og', 'loc_og_t2t',
'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g',
'arpu_4g', 'night_pck_user', 'fb_user', 'aug_vbc_5g', 'Churn Value',
'Referred a Friend', 'Number of Referrals', 'Phone Service',
'Multiple Lines', 'Internet Service', 'Internet Type',
'Streaming Data Consumption', 'Online Security', 'Online Backup',
'Device Protection Plan', 'Premium Tech Support', 'Streaming TV',
'Streaming Movies', 'Streaming Music', 'Unlimited Data',
'Payment Method', 'Status ID', 'Satisfaction Score', 'Churn Category',
'Churn Reason', 'Customer Status', 'offer'],
dtype='object')### **Customer ID**
#We have no idea how many customers we have over how many months. Let's see that. Start coding or generate with AI.
# To get the number of unique customers in the dataset
len(df["Customer ID"].unique()) Start coding or generate with AI.
98230
# To get the number of unique months in the dataset
len(df["Month"].unique()) Start coding or generate with AI.
14
# To get the basic information of the dataset
df.info() Start coding or generate with AI.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 653753 entries, 0 to 653752 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer ID 653753 non-null object 1 Month of Joining 653753 non-null int64 2 Month 653753 non-null int64 3 zip_code 653753 non-null int64 4 Gender 643729 non-null object 5 Age 653753 non-null int64 6 Married 639377 non-null object 7 Dependents 629129 non-null object 8 Number of Dependents 653753 non-null int64 9 Location ID 653753 non-null object 10 Service ID 653753 non-null object 11 state 653753 non-null object 12 county 652835 non-null object 13 timezone 653606 non-null object 14 area_codes 614096 non-null float64 15 country 653753 non-null object 16 latitude 653753 non-null float64 17 longitude 653753 non-null float64 18 arpu 653753 non-null float64 19 roam_ic 624924 non-null float64 20 roam_og 633471 non-null float64 21 loc_og_t2t 626062 non-null float64 22 loc_og_t2m 625744 non-null float64 23 loc_og_t2f 586487 non-null float64 24 loc_og_t2c 630546 non-null float64 25 std_og_t2t 616536 non-null float64 26 std_og_t2m 607726 non-null float64 27 std_og_t2f 626333 non-null float64 28 std_og_t2c 542768 non-null float64 29 isd_og 612213 non-null float64 30 spl_og 559697 non-null float64 31 og_others 632262 non-null float64 32 loc_ic_t2t 618673 non-null float64 33 loc_ic_t2m 626915 non-null float64 34 loc_ic_t2f 629922 non-null float64 35 std_ic_t2t 596420 non-null float64 36 std_ic_t2m 627547 non-null float64 37 std_ic_t2f 635174 non-null float64 38 std_ic_t2o 570165 non-null float64 39 spl_ic 620476 non-null float64 40 isd_ic 570988 non-null float64 41 ic_others 613198 non-null float64 42 total_rech_amt 653753 non-null object 43 total_rech_data 443849 non-null object 44 vol_4g 653753 non-null float64 45 vol_5g 653753 non-null float64 46 arpu_5g 442898 non-null float64 47 arpu_4g 441434 non-null float64 48 night_pck_user 280650 non-null float64 49 fb_user 243359 non-null float64 50 aug_vbc_5g 653753 non-null float64 51 Churn Value 653753 non-null int64 52 Referred a Friend 653753 non-null object 53 Number of Referrals 653366 non-null float64 54 Phone Service 653753 non-null object 55 Multiple Lines 653753 non-null object 56 Internet Service 653753 non-null object 57 Internet Type 436421 non-null object 58 Streaming Data Consumption 653753 non-null int64 59 Online Security 653753 non-null object 60 Online Backup 653753 non-null object 61 Device Protection Plan 653753 non-null object 62 Premium Tech Support 653753 non-null object 63 Streaming TV 653753 non-null object 64 Streaming Movies 653753 non-null object 65 Streaming Music 653753 non-null object 66 Unlimited Data 653753 non-null object 67 Payment Method 653753 non-null object 68 Status ID 653753 non-null object 69 Satisfaction Score 653753 non-null int64 70 Churn Category 29736 non-null object 71 Churn Reason 29767 non-null object 72 Customer Status 653753 non-null object 73 offer 159109 non-null object dtypes: float64(35), int64(8), object(31) memory usage: 369.1+ MB
#**Observation:** There are no null values in the Customer ID, Month of Joining, or Month columns.
#Lets plot how many customers are joining in each month Start coding or generate with AI.
# To generate a bar plot of the count of customers who joined the telecom service in each month, sorted by month
df['Month of Joining'].astype(int).value_counts().sort_index().plot.bar(figsize=(10,5))
plt.title('Count of Customers Joined by Month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.show() Start coding or generate with AI.
#**Observation**: A large number of customers joined in June.
#It could be due to a seasonal promotion or marketing campaign that was particularly effective,
# or it could be due to external factors such as changes in the market or industry. Start coding or generate with AI.
#Further analysis and context would be needed to make a more informed hypothesis Start coding or generate with AI.
# Distribution of customers across months
customers_per_month = df.groupby("Month")["Customer ID"].nunique()
# Plotting the distribution of customers across months
customers_per_month.plot(kind="bar", figsize=(10, 6))
plt.title("Number of Customers per Month")
plt.xlabel("Month")
plt.ylabel("Number of Customers")
plt.show() Start coding or generate with AI.
###**Observation**: From the graph, we can see that the number of customers steadily increases from January to May,
#after which there is a sharp increase in June, followed by a steady increase and decreasing trend.
#This could indicate some issue or change in the company's services or marketing strategy during that period. Start coding or generate with AI.
#Let's look at customer journeys! Start coding or generate with AI.
# Filtering for customer ID - uqdtniwvxqzeu1
df[df["Customer ID"]=='uqdtniwvxqzeu1'] Start coding or generate with AI.
#**Observation**: We can observe that this consumer joined the company in the sixth month and stayed until the fourteenth month. Start coding or generate with AI.
# Filtering for customer ID - aabakestdecft46766
df[df["Customer ID"]=='aabakestdecft46766'] Start coding or generate with AI.
#**Observation**: This consumer joined in the third month and left in the same month. Start coding or generate with AI.
# Compute basic statistics of count of rows per customer
df.groupby(['Customer ID']).size().describe() Start coding or generate with AI.
#**Observation**:
#The average duration of a customer is 6 months, then we can generate the following hypotheses:
#* The company needs to improve their customer retention strategies to increase the average duration of customers.
#* The company needs to evaluate their pricing strategies to retain customers for a longer duration.
#* The company need to focus on improving the quality of their services to increase customer loyalty and prolong their duration of service. Start coding or generate with AI.
#To understand the actual problem we need to analyze other variables and see what insights they show in order to support or reject these hypotheses Start coding or generate with AI.
# Count the number of customers with a particular number of rows
df.groupby(['Customer ID']).size().value_counts() Start coding or generate with AI.
#**Observation**: This shows that more people are present in 5 to 9 month bracket. We can generate the following hypotheses:
#* Customers who stay with the telecom service provider for a longer duration are more likely to stay for 5-9 months.
#* Customers may be facing some issues or problems with the service after 9 months, leading to a higher churn rate after that period. Start coding or generate with AI.
### **Zip Code** Start coding or generate with AI.
#Some possible hypotheses that could be formed are:
#* The distribution of customers across zip codes follows a normal distribution.
#* There are a few zip codes with significantly more customers than others.
#* The distribution of customers across zip codes is skewed to the right, with the majority of zip codes having fewer customers. Start coding or generate with AI.
Double-click (or enter) to edit
Observation: As we can see the distribution confirms our hypothesis of normal distribution.
As the distribution of customers across zip codes does not show much skewness, it suggests that the customer base is relatively evenly distributed across zip codes. There are no clear pockets of customers in particular zip codes that stand out as being much more popular than others. This could be a positive sign for the business, as it suggests that the customer base is broad and not overly reliant on a particular geographic area
# Number of unique zip codes in the dataset
num_unique_zips = df['zip_code'].nunique()
# Number of customers in each zip code
cust_per_zip = df['zip_code'].value_counts()
# Summary statistics of the number of customers per zip code
cust_per_zip.describe()
# Plotting the distribution of customers across zip codes
cust_per_zip.plot(kind='hist', bins=num_unique_zips, figsize=(10,6))
plt.title('Distribution of Customers Across Zip Codes')
plt.xlabel('Number of Customers')
plt.ylabel('Frequency')
plt.show() Start coding or generate with AI.
### **Gender** Start coding or generate with AI.
# Count the number of customers in each gender category
gender_counts = df['Gender'].value_counts()
# Print the gender counts
print(gender_counts) Start coding or generate with AI.
Female 278668 Male 276917 Not Specified 48938 Other 39206 Name: Gender, dtype: int64
# Counting the number of unique customers whose gender is missing
print('Customers whose gender is missing:' ,df[df['Gender'].isna()]['Customer ID'].nunique()) Start coding or generate with AI.
Customers whose gender is missing: 1487
Observation: We can observe that 1487 consumers have no gender. Let's impute it with "Not specified".
# Filling missing values in the Gender column with 'Not Specified'
df['Gender'] = df['Gender'].fillna('Not Specified')
# Counting the number of customers for each gender, including those with missing values
df['Gender'].value_counts(dropna=False) Start coding or generate with AI.
# Age Start coding or generate with AI.
#Possible hypotheses that could formed for "Age" column are:
#* Most customers fall in the age group of 20 to 40 years.
#* There will be outliers present in the age column as some customers might have entered invalid age. Start coding or generate with AI.
# Distribution of customers across age groups
plt.figure(figsize=(10,6))
sns.distplot(df['Age'])
plt.title('Distribution of Customers across Age Groups')
plt.xlabel('Age')
plt.ylabel('Number of Customers')
plt.show()
# Descriptive statistics of age column
df['Age'].describe() Start coding or generate with AI.
Observation: As we can see the maximum age is 9999. This proves our hypothesis that there will be outliers present in the age column as some customers might have entered invalid age.
# let's see the sorted age values
df['Age'].sort_values() Start coding or generate with AI.
# So we need to replace 9999 by mean or median! or 95th percintle!
df['Age'].quantile([0.8,0.9,0.95,0.99]) Start coding or generate with AI.
# So we need to replace 9999 by mean or median! or 95th percintle!
df['Age'].quantile([0.99,0.999,0.99991,0.999999]) Start coding or generate with AI.
# Hence we will replace every value in age column by greater than 76 by lets say similar values through imputer.
df['Age'].value_counts() Start coding or generate with AI.
# We will fill the NaN's later by using impute method
# For all greater than 75 we will make the age as NaN
df['Age'].replace(9999, np.NaN, inplace=True) Start coding or generate with AI.
# Extreme Quantiles
df['Age'].quantile([0.99,0.999,0.99991,0.999999]) Start coding or generate with AI.
#We will enter null values for Age later. Start coding or generate with AI.
Observation: The data now seems to be valid as it also proves our hypothesis of most of the customer ages would be between 20-40.
### **Married** Start coding or generate with AI.
# Count of each unique value in the 'Married' column including NaN values
df['Married'].value_counts(dropna=False) Start coding or generate with AI.
# Count of unique customer IDs where 'Married' column is NaN
df[df['Married'].isna()]['Customer ID'].nunique() Start coding or generate with AI.
2179
# Replacing NaN values in the 'Married' column with 'Not Specified'
df['Married'] = df['Married'].fillna('Not Specified') Start coding or generate with AI.
# Count of each unique value in the 'Married' column after filling NaN values with 'Not Specified'
df['Married'].value_counts(dropna=False) Start coding or generate with AI.
#Dependents Start coding or generate with AI.
# Count the number of unique values in "Dependents" column
df['Dependents'].value_counts(dropna=False) Start coding or generate with AI.
Observation:The Dependents and Number of Dependents columns are linked.
But, for the time being, we will simply replace null with not specified and deal with both columns together in the following section.
# Fill missing values in dependents with "Not Specified"
df['Dependents']=df['Dependents'].fillna('Not Specified') Start coding or generate with AI.
# Count the number of unique values in "Dependents" column after imputation
df['Dependents'].value_counts(dropna=False) Start coding or generate with AI.
# Percentage of customers with dependents
percent_with_dependents = (df["Dependents"].value_counts(normalize=True) * 100).loc["Yes"]
print(f"Percentage of customers with dependents: {percent_with_dependents:.2f}%") Start coding or generate with AI.
Percentage of customers with dependents: 25.94%
Observation: The percentage of customers with dependents gives us insights into the family structure of the customer base. This can help us understand the type of services that are in demand among families and whether there are opportunities to introduce family-oriented plans and promotions.
#Number of Dependents Start coding or generate with AI.
# Counting the number of unique values in Number of Dependents column
df['Number of Dependents'].value_counts(dropna=False) Start coding or generate with AI.
Observation: There are no null values in the number of dependents column.
Nonetheless, 699 dependents stand out as an exception in this case. Let's substitute nan and impute them later.
Problem:
# Replace 699 with Nan
df.loc[df['Number of Dependents']==699,'Number of Dependents']=np.nan Start coding or generate with AI.
# Count the frequency of unique values in the 'Dependents' and 'Number of Dependents' columns
df[['Dependents','Number of Dependents']].value_counts(dropna=False) Start coding or generate with AI.
Observation:
Some rows have NaN values in 'Dependents' column and 'Number of Dependents' column has value 0 for those rows.
It seems ambiguous as the 'Dependents' column indicates that the customer has dependents but the 'Number of Dependents' is 0.
It is recommended to check with the business team for clarification. However, we will not address this in the current notebook.
#Location ID Aand Service Start coding or generate with AI.
# Count the number of unique Location IDs
df['Location ID'].nunique() Start coding or generate with AI.
98230
# Count the number of unique Service IDs
df['Service ID'].nunique() Start coding or generate with AI.
98230
Observation: It appears that each customer ID has a distinct Location ID and Service ID and they go hand in hand.
# Using value_counts() method to get the count of unique values for Location ID and Service ID columns
df[['Location ID','Service ID']].value_counts(dropna=False) Start coding or generate with AI.
Observation: The counts of unique values for Location ID and Service ID columns are almost same which indicates that location ID and service ID are related to each other and thus our assumption is proved.
#State Start coding or generate with AI.
# Count the unique values of State column
df['state'].value_counts(dropna=False) Start coding or generate with AI.
# Bar plot of customer count for each state
plt.figure(figsize=(10,6))
df['state'].value_counts(dropna=False).plot.bar()
plt.title('Bar plot of customer count for each state')
plt.xlabel('State')
plt.ylabel('Number of Customers')
plt.show() Start coding or generate with AI.
# Grouping by state and calculating unique customer count
df.groupby(['state'],as_index=False).agg(Number_of_customers=('Customer ID','nunique')) Start coding or generate with AI.
Observation: It can be observed that the maximum number of customers are from California with 57665 Customers IDs.
We earlier saw that the normal distribution of zip codes suggests that customers are evenly distributed across zip codes.
Hypothesis:
# Count the number of unique zip codes per state
df.groupby('state')['zip_code'].nunique() Start coding or generate with AI.
Yes! California has maximum zip codes.
Hypotheses that can be deduced:
The telecom service provider has a strong presence in California, leading to a higher customer base.
Customers from California have higher mobile phone usage compared to other states, leading to more subscriptions with the telecom service provider.
These conclusions also lead to the following problems:
Location-based marketing strategy: The insight that the majority of customers are from California could be used to develop a location-based marketing strategy that targets customers in this state with customized offers, campaigns, and promotions.
Improve customer distribution: Analyze and bring in more customers from other states to even out the customer distribution and ensure the company is not overly reliant on a particular area.
#County Start coding or generate with AI.
# Group by county and calculate number of unique customers in each county
df.groupby(['county'],as_index=False).agg(Number_of_customers=('Customer ID','nunique')).sort_values(by=['Number_of_customers']) Start coding or generate with AI.
# Count the number of missing values in county column
df['county'].isna().sum() Start coding or generate with AI.
918
Observations:
Los Angeles has the highest number of customers while Garfield county has the lowest number of customers.
There are 918 missing values in the county column.
Hypotheses:
Analyzing customer preferences and behaviors in different locations can help the company target their marketing and promotional strategies effectively.
# Finding the values of state, latitude and longitude where county is missing
df[df['county'].isna()][['state','latitude','longitude']].value_counts() Start coding or generate with AI.
# Finding the county for specific latitude and longitude values
df[(df.latitude==32.99)&(df.longitude==-106.97)]['county'].value_counts(dropna=False) Start coding or generate with AI.
# Finding the county for specific latitude and longitude values
df[(df.latitude==34.75)&(df.longitude==-92.27)]['county'].value_counts(dropna=False) Start coding or generate with AI.
# Finding the county for specific latitude and longitude values
df[(df.latitude==35.82)&(df.longitude==-106.20)]['county'].value_counts(dropna=False) Start coding or generate with AI.
# Finding the county for specific latitude and longitude values
df[(df.latitude==47.85)&(df.longitude==-122.22)]['county'].value_counts(dropna=False) Start coding or generate with AI.
# Finding the county for specific latitude and longitude values
df[(df.latitude== 37.95)&(df.longitude==-121.29 )]['state'].value_counts(dropna=False) Start coding or generate with AI.
# Replacing missing values in county with 'Unknown'
df['county']=df['county'].fillna('Unknown') Start coding or generate with AI.
Observation: We see that there are some values of latitude and longitude for which the county value is missing (NaN). To fill in these missing values, we can find the county corresponding to those latitude and longitude values. However, for this exercise, we will simply replace the missing county values with 'Unknown'.
#Timezone Start coding or generate with AI.
# Count the number of customers in each timezone
df['timezone'].value_counts(dropna=False) Start coding or generate with AI.
# Identify the customers with missing timezone and their state and county information
df[df['timezone'].isna()][['state','county']].value_counts(dropna=False) Start coding or generate with AI.
# Check the unique timezone value for customers in Craighead County
df[df.county=='Craighead County'].timezone.unique() Start coding or generate with AI.
array([nan, 'America/Chicago'], dtype=object)
# Fill the missing timezone values with "America/Chicago"
df['timezone']=df['timezone'].fillna('America/Chicago') Start coding or generate with AI.
Observation: There are 147 null values for timezone, all of which belong to Craighead County. The corresponding timezone for this county is "America/Chicago", so we safely replaced the missing values with this timezone.
Hypothesis:
We'll have to analyze and see later.
#Area codes Start coding or generate with AI.
# Count the values of area codes
df['area_codes'].value_counts(dropna=False) Start coding or generate with AI.
#There are some missing values. Let us explore deeper. Start coding or generate with AI.
# Count the values of state and county for missing values of area codes
df[df['area_codes'].isna()][['state','county']].value_counts(dropna=False) Start coding or generate with AI.
Let's see if one county has only one area codes value.
# Count the values of County and area codes
df[['county','area_codes']].value_counts(dropna=False) Start coding or generate with AI.
Observation: As we can see, San Diego County has at least two area codes. As a result, no two counties share the same area code.
We can verify with the company about this.
For the time being, let us simply replace na with unknown.
# Filling the missing value with Unknown
df['area_codes']=df['area_codes'].fillna('Unknown') Start coding or generate with AI.
#Latitude and Longitude Start coding or generate with AI.
# Check the number of nulls in latitude
df['latitude'].isna().sum() Start coding or generate with AI.
0
# Check the number of nulls in longitude
df['longitude'].isna().sum() Start coding or generate with AI.
0
# Count the unique values of latitude and longitude
df[['latitude','longitude']].value_counts() Start coding or generate with AI.
Are these from California, as there are many maximum consumers from that state?
# Checking if the latitude and longitude values are from CA
df[df['latitude']==38.58][['latitude', 'longitude', 'state']].value_counts() Start coding or generate with AI.
yes
Observation:
#**Average Revenue per User (ARPU)** Start coding or generate with AI.
Some hypotheses for univariate analysis are:
We don't know which customers have been good or bad to the organization.
The distribution of arpu values might be skewed. The median arpu might be a better measure of central tendency than the mean arpu.
There might be some outliers in arpu values.
# Check the null values in ARPU
df['arpu'].isna().sum() Start coding or generate with AI.
0
There are no null values in arpu.
Observation:
The distribution of ARPU values is right-skewed, with most values concentrated below 500 which indicates very high ARPU values for a few customers proving our hypotheses.
There are some negative ARPU values, which may be due to refunds. A negative ARPU means that the company is losing money on a per-user basis, rather than making a profit. It is not uncommon for telecom companies to have negative ARPU due to the high costs associated with acquiring and servicing customers.
# ARPU is negative here.
# Lets investigate
df.arpu.describe() Start coding or generate with AI.
# Now let us investigate how many people have negative arpu
df[df['arpu']<0].shape Start coding or generate with AI.
(115942, 74)
# So nearly 115K customers have negative ARPU.
# Let's calculate the percentage
115942/df.shape[0] Start coding or generate with AI.
0.17734832574382067
Observation:
The fact that 17% of the people have negative arpu means that a significant proportion of the customer base is not generating revenue for the telecom company.
This could be due to a number of reasons, such as inactive accounts, delinquent accounts, or accounts with outstanding balances.
To understand more about this issue, we can investigate other variables that may be related to negative arpu, such as churn status. We can also look at the distribution of negative arpu values across different demographic groups, such as age, gender, and location, to see if there are any patterns or trends.
Additionally, we can explore whether there are any correlations between negative arpu and other variables, such as usage patterns, service subscriptions, and customer satisfaction scores to understand what's causing this behaviour.
Overall, further investigation is needed to fully understand the implications of negative arpu for the telecom company.
#Average revenue per user over 4G and 5G network (ARPU 4G & ARPU 5G) Start coding or generate with AI.
# Check nulls
df['arpu_4g'].isna().sum() Start coding or generate with AI.
212319
# Check nulls
df['arpu_5g'].isna().sum() Start coding or generate with AI.
210855
There are about 210K null values in arpu_4g and arpu_5g.
# We have not been given any information if adding these 2 up will make overall arpu
# Let us analyze all 3 variables together
(df['arpu_5g']+df['arpu_4g']==df['arpu']).value_counts(dropna=False) Start coding or generate with AI.
Observation: arpu_4g and arpu_5g does not add up to give overall arpu.
Hypothesis:
Overall ARPU = arpu_4g + arpu_5g + other_arpu_components
where other_arpu_components represent the revenue generated from other sources such as voice calls, messaging, and data usage on 2G and 3G networks.
#Lets see how can we deal with null values in these columns Start coding or generate with AI.
# Frequency count of different values in the 'total_rech_data' column for users who did not have any data recharge
# and also have missing values in both 'arpu_4g' and 'arpu_5g' columns
df[(df['arpu_4g'].isna())&(df['arpu_5g'].isna())]['total_rech_data'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
arpu_4g and arpu_5g have no recharge for internet service, then it's safe to assume that the null values in arpu_4g and arpu_5g also represent customers who have not recharged for internet service. We will impute this values once we analyze the total_rech_data variable.#Night Pack user Start coding or generate with AI.
# Check the value counts
df['night_pck_user'].value_counts(dropna=False) Start coding or generate with AI.
Observation: Since the night_pck_user column has more than 50% missing values, it's not wise to impute those missing values as it may lead to biased results. So, it's better to keep the NaNs as they are and analyze the available data.
# **Facebook User - Social Networking Scheme** Start coding or generate with AI.
# Check the counts of unique values in fb user column
df['fb_user'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
Since the fb_user column has more than 50% null values, we cannot draw any conclusions regarding this variable.
It would not be meaningful to perform any analysis or generate any visualizations based on this variable. If we were to include fb_user in our analysis, we would have to drop a significant portion of the data, which could result in biased or inaccurate results.
Therefore, it may be best to exclude this variable from our analysis altogether.
### **Volume Based cost for 5G network (outside the scheme paid based on extra usage)** Start coding or generate with AI.
# Plot distribution of aug_vbc_5g
plt.figure(figsize=(10,6))
sns.distplot(df['aug_vbc_5g'])
# Add title and labels to the plot
plt.title('Distribution of Volume Based Cost for 5G Network')
plt.xlabel('Volume Based Cost')
plt.ylabel('Density') Start coding or generate with AI.
# statistics for aug_vbc_5g
df['aug_vbc_5g'].describe() Start coding or generate with AI.
Hypotheses:
The distribution of aug_vbc_5g may have outliers.
The max value of 254687 may not be a correct value.
# Max value of aug_vbc_5g
df['aug_vbc_5g'].quantile([0.8,0.9,0.95,0.99]) Start coding or generate with AI.
# So we are having problems in 99th percentile
df['aug_vbc_5g'].quantile([0.99,0.999,0.9999]) Start coding or generate with AI.
# Replace incorrect max value with NaN
df['aug_vbc_5g'].replace(254687.0, np.NaN, inplace=True) Start coding or generate with AI.
# Investigate max value of aug_vbc_5g
df['aug_vbc_5g'].quantile([0.99,0.999,0.9999]) Start coding or generate with AI.
# Replace incorrect max value with NaN
df['aug_vbc_5g'].replace(87978, np.NaN, inplace=True) Start coding or generate with AI.
# Plot distribution of aug_vbc_5g again
plt.figure(figsize=(10,6))
sns.distplot(df['aug_vbc_5g'])
# Add title and labels to the plot
plt.title('Distribution of Volume Based Cost for 5G Network')
plt.xlabel('Volume Based Cost')
plt.ylabel('Density') Start coding or generate with AI.
# Print updated quantiles for aug_vbc_5g
df['aug_vbc_5g'].quantile([0.8, 0.9, 0.99]) Start coding or generate with AI.
# Filter values > 5000 for aug_vbc_5g
df[df['aug_vbc_5g']>5000] Start coding or generate with AI.
Observations:
The maximum value of 254687 was replaced with NaN due to it being an incorrect value.
One additional incorrect value of 87978 was also replaced with NaN.
The 99th percentile value was changed to 7387.188 after replacing the incorrect values.
Values greater than 5000 were found to be valid values that may require transformations or normalization later.
#Churn Value Start coding or generate with AI.
# Count the values of churned customers
df['Churn Value'].value_counts(dropna=False) Start coding or generate with AI.
There are no null values in the column "Churn Value".
# Normalizing value counts
df['Churn Value'].value_counts(normalize=True) Start coding or generate with AI.
Observation: There are just 0.04% of rows with churn value 0, so let's examine how many distinct customers there are.
# Unique customer IDs and their churn values
cust_churn = df[['Customer ID', 'Churn Value']].drop_duplicates()
# Calculate the percentage of customers who churned out
churn_percent = round((cust_churn['Churn Value'].value_counts()[1] / len(cust_churn)) * 100, 2)
print("Percentage of customers who churned out:", churn_percent, "%") Start coding or generate with AI.
Percentage of customers who churned out: 24.72 %
# Convert Month column to integer and filter Churn Value =1
churn_monthly = df[df['Churn Value'] == 1]['Month'].astype(int).value_counts().sort_index()
# Plotting bar chart
plt.figure(figsize=(10,6))
plt.bar(churn_monthly.index, churn_monthly.values)
plt.title('Monthly Churned Customers')
plt.xlabel('Month')
plt.ylabel('Number of Customers')
plt.show() Start coding or generate with AI.
Observation:
From the plot, we can see that there is a gradual increase in the number of churned customers from month 7 to month 11.
Month 11 has the highest number of churned customers, indicating that there may have been some issue with the services provided during that month that led to customers leaving the network. We will have to analyze this further.
#Offer Start coding or generate with AI.
# Count the unique values in Offer column
df['offer'].value_counts(dropna=False) Start coding or generate with AI.
Observation: There are a lot of null values in the offer column.
Hypotheses:
It's possible that the null values in the offer column are due to customers not being eligible for any offers.
However, it's also possible that there was an error in recording the data or that some customers were missed when the offers were being given out.
We can investigate further by looking at the distribution of offer types and see if there are any patterns based on customer demographics or behavior.
# Filling Nulls with No Offer
df['offer']=df['offer'].fillna('No Offer') Start coding or generate with AI.
# Check the value counts again
df[df['Churn Value']==1]['offer'].value_counts() Start coding or generate with AI.
Observation:
If the offers are distributed in the same way, it suggests that the company has not targeted any specific group of customers with special offers.
This could mean that the company is more focused on providing equal benefits to all customers rather than targeting specific segments.
We can explore the relationship between the offers and customer data, such as age, gender, location, and usage patterns. This could help identify any patterns or preferences among customers that could help with the future marketing strategies.
Analyzing the impact of the offers on customer churn could help the company optimize their promotional campaigns.
### **Referred a Friend and Number of Referrals** Start coding or generate with AI.
# Count the values if a customer referred the company's services to a friend
df['Referred a Friend'].value_counts(dropna=False) Start coding or generate with AI.
# Here are the number of referrals
df['Number of Referrals'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
There are some null values. We can add them later.
It will be interesting to see both the column of friends referred and the number of referrals together.
# Count the values of Referred a Friend and Number of Referrals
df[['Referred a Friend','Number of Referrals']].value_counts(dropna=False) Start coding or generate with AI.
Observations:
The majority of customers have answered "no" to the question "Referred a Friend" and have also not made any referrals.
A small percentage of customers who have answered "no" to the question "Referred a Friend" have made referrals.
Among the customers who have answered "yes" to the question "Referred a Friend", there is a wide range in the number of referrals made, with some customers having made only one or two referrals while others have made 10-11 as well.
To further investigate the small percentage of customers who have made referrals despite answering "no" to the question "Referred a Friend", it may be useful to analyze the source of the referrals.
To incentivize more referrals and increase customer acquisition, the company may want to consider offering referral bonuses or other rewards to customers who successfully refer new customers.
The company could analyze the characteristics and behavior of customers who are more likely to make referrals and improve their marketing campaigns.
### **Phone Service** Start coding or generate with AI.
# Count the value of customers with and without phone service
plt.figure(figsize=(10,6))
df['Phone Service'].value_counts(dropna=False).plot.bar()
plt.title("Distribution of Customers with Phone Service")
plt.ylabel("Count")
plt.show() Start coding or generate with AI.
Observation:
Phone Service is a popular choice among customers, with over 5 lakh customers opting for it over the 14-month period. This suggests that it is an important service for the company to continue offering and investing in.
It would be interesting to explore the demographics and usage patterns of customers who opted for Phone Service versus those who did not. This could help the company better understand the needs and preferences of its customer base.
### **Internet Service and Internet Type** Start coding or generate with AI.
# Count the values of the customers with and without Internet Service
df['Internet Service'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
About 4L customers (not unique) have opted for internet service over 14 months.
To gain more insights, we can combine this observation with other columns like churn value, total charges, contract type, payment method, and monthly charges, as mentioned earlier. This can help us understand the behavior and preferences of customers who have opted for internet service and identify potential areas for improvement or upselling.
# Count the Internet Type values
df['Internet Type'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
There are four types of internet services offered: Fiber Optic, Cable, DSL, and None.
Majority of customers have opted for Fiber Optic and Cable internet services.
There may be cases where the internet service is yes but the internet type is not specified (NaN values).
There may be cases where the internet service is no but the internet type is specified (which does not make sense).
# Count the values of internet service and type together
df[['Internet Service','Internet Type']].value_counts(dropna=False) Start coding or generate with AI.
Observation:
As these are not unique values , they are monthly values for each customer we'll have to see which customers had internet service for that month or not.
Some customers have said yes to having internet service but their internet type is none.
It could mean that the customer has not used the internet during that month. However, it could also mean that there is missing data or the customer is using an unknown type of internet service.
We can check if there is any usage data available for the customer during that month. If there is usage data available, we can assume that the customer has used the internet and fill the missing value with the most common internet type for that customer. If there is no usage data available, we can leave the value as None.
#Unlimited data Start coding or generate with AI.
# Count the values of the Unlimited Data column
df['Unlimited Data'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
#Streaming Movies Start coding or generate with AI.
# Count the values of the streaming movies column
df['Streaming Movies'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
# Streaming Tv Start coding or generate with AI.
# Count the values of the streaming TV column
df['Streaming TV'].value_counts(dropna=False) Start coding or generate with AI.
#Streaming Music Start coding or generate with AI.
# Count the values of the streaming movies column
df['Streaming Music'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
# Streaming data consumption Start coding or generate with AI.
df['Streaming Data Consumption'].describe() Start coding or generate with AI.
# Plot
plt.figure(figsize=(10,6))
sns.distplot(df['Streaming Data Consumption'])
plt.xlabel('Streaming Data Consumption')
plt.ylabel('Frequency')
plt.show() Start coding or generate with AI.
# check null values
df['Streaming Data Consumption'].isna().sum() Start coding or generate with AI.
0
Observation:
The minimum value is 0 GB, indicating that some customers did not consume any streaming data during the month, while the maximum value is 85 GB.
The highly right skewed distribution of streaming consumption data indicates that a small proportion of customers are consuming a disproportionately large amount of streaming data.
From a business perspective, this information could be used to tailor marketing and pricing strategies to target heavy streaming users and to offer them plans that meet their needs.
We can further explore how streaming data consumption relates to other columns such as "Streaming Movies" or "Streaming TV" to gain additional insights into customer behavior and preferences with respect to churn value and satisfaction score.
There are no null values in the column.
#Online security Start coding or generate with AI.
# Check the counts of Online Security column
df['Online Security'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
Observation:
#Premium Tech Support Start coding or generate with AI.
# Count the unique values of Premium Tech Support
df['Premium Tech Support'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
#Payment method Start coding or generate with AI.
# Count the unique values of payment method
df['Payment Method'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
Bank withdrawal is the most popular payment method among customers.
There is a significant difference in the number of customers using bank withdrawal compared to credit card or wallet balance.
#Status ID Start coding or generate with AI.
# Total null values in Status ID
df['Status ID'].isna().sum() Start coding or generate with AI.
0
# Number of Unique Status IDs
df['Status ID'].nunique() Start coding or generate with AI.
653753
Each payment made by a consumer has a unique status Id.
# # Count the values each satisfaction score
df['Satisfaction Score'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
Most of the customers have rated their satisfaction with a score of 3.
A significant number of customers have rated their satisfaction with scores of 1 and 2.
The satisfaction score is an important metric to measure customer satisfaction and loyalty.
It is important to investigate why a significant number of customers have rated their satisfaction with lower scores, i.e., 1 and 2.
Based on the insights obtained from other columns, such as the service type, internet type, and device protection plan, we can analyze whether there is a correlation between these variables and customer satisfaction.
#Churn Catogory Start coding or generate with AI.
# Count the values of different churn categories
df['Churn Category'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
The "Churn Category" column has a high percentage of missing values, with NaN accounting for the majority of the values.
Let's investigate the null values.
# Let's see the churn value for all those observations where churn category is null
df[df['Churn Category'].isna()]['Churn Value'].value_counts(dropna=False) Start coding or generate with AI.
Hence, if a customer is not churned, we can simply mark it as not applicable and mark others as unknown.
# Filling null values of churn category as not applicatable where customers did not churn
df.loc[(df['Churn Category'].isna())&(df['Churn Value']==0),'Churn Category']='Not Applicable' Start coding or generate with AI.
# Filling the churn category as Unknown where churn value was 1
df.loc[(df['Churn Category'].isna())&(df['Churn Value']==1),'Churn Category']='Unknown' Start coding or generate with AI.
# Check the counts again
df['Churn Category'].value_counts(dropna=False) Start coding or generate with AI.
#Churn Reason Start coding or generate with AI.
# Count the values of different churn reasons
df['Churn Reason'].value_counts(dropna=False) Start coding or generate with AI.
Observation:
The "Churn Reason" column has a high percentage of missing values.
Let's investigate the null values.
There are some null and absurd values in the churn reason column.
# Count of null churn reasons and their churn value
df[(df['Churn Reason'].isna())]['Churn Value'].value_counts(dropna=False) Start coding or generate with AI.
Anytime a customer is not churned, we can simply mark it as not applicable and mark others as unknown.
# Filling the churn reason as not applicable where churn value was 0
df.loc[(df['Churn Reason'].isna())&(df['Churn Value']==0),'Churn Reason']='Not Applicable' Start coding or generate with AI.
# Filling the churn reason as Unknown where churn value was 1
df.loc[(df['Churn Reason'].isna())&(df['Churn Value']==1),'Churn Reason']='Unknown' Start coding or generate with AI.
# Check the value counts in the Churn Reason again
df['Churn Reason'].value_counts(dropna=False) Start coding or generate with AI.
#Customer status Start coding or generate with AI.
# Check the unique values
df['Customer Status'].unique() Start coding or generate with AI.
array(['Churned', 'Stayed'], dtype=object)
# Check customer status and churn values together
df[['Customer Status','Churn Value']].value_counts(dropna=False) Start coding or generate with AI.
Everything seems good here!
#Customer call usage analysis Start coding or generate with AI.
# All columns
df.columns Start coding or generate with AI.
Index(['Customer ID', 'Month of Joining', 'Month', 'zip_code', 'Gender', 'Age',
'Married', 'Dependents', 'Number of Dependents', 'Location ID',
'Service ID', 'state', 'county', 'timezone', 'area_codes', 'country',
'latitude', 'longitude', 'arpu', 'roam_ic', 'roam_og', 'loc_og_t2t',
'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g',
'arpu_4g', 'night_pck_user', 'fb_user', 'aug_vbc_5g', 'Churn Value',
'Referred a Friend', 'Number of Referrals', 'Phone Service',
'Multiple Lines', 'Internet Service', 'Internet Type',
'Streaming Data Consumption', 'Online Security', 'Online Backup',
'Device Protection Plan', 'Premium Tech Support', 'Streaming TV',
'Streaming Movies', 'Streaming Music', 'Unlimited Data',
'Payment Method', 'Status ID', 'Satisfaction Score', 'Churn Category',
'Churn Reason', 'Customer Status', 'offer'],
dtype='object')# Setting customer id and month as index
df=df.set_index(['Customer ID','Month']) Start coding or generate with AI.
# Selecting specific columns to check the customer call usage
df_call_usage=df[['Month of Joining','roam_ic', 'roam_og', 'loc_og_t2t',
'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
'total_rech_amt','Phone Service']] Start coding or generate with AI.
# Top three rows
df_call_usage.head(3) Start coding or generate with AI.
# There are a lot of null values
df_call_usage.describe() Start coding or generate with AI.
Observations:
8.787978e+06.
Some columns seem to have null values.
The column total_rech_amt and Phone Service are missing from this table.
# Let's check how many null values are present where calling plan is there
df_call_usage[df_call_usage['Phone Service']=='Yes'].isna().sum() Start coding or generate with AI.
# Lets check how many % values are missing
df_call_usage[df_call_usage['Phone Service']=='Yes'].isna().sum()/df_call_usage.shape[0] Start coding or generate with AI.
#lets check how many null values are present where calling plan is not there
df_call_usage[df_call_usage['Phone Service']=='No'].isna().sum() Start coding or generate with AI.
# Lets check how many % values are missing
df_call_usage[df_call_usage['Phone Service']=='No'].isna().sum()/df_call_usage.shape[0] Start coding or generate with AI.
We will impute all these values later!
# lets find datatypes of all the columns
df_call_usage.dtypes Start coding or generate with AI.
Observe that total_rech_amt is object but according to data dictionary it is a float.
# There are 124570 rows containg no calling plan
df_call_usage['Phone Service'].value_counts() Start coding or generate with AI.
# Counts of unique values in total rech amount
df_call_usage['total_rech_amt'].value_counts() Start coding or generate with AI.
# Replace the invalid value
df_call_usage.loc[df['total_rech_amt']=='e01','total_rech_amt']=10 Start coding or generate with AI.
Let's now change the column to a numeric datatype.
# Changing the column back to numeric data type
df_call_usage['total_rech_amt']=pd.to_numeric(df_call_usage['total_rech_amt']) Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage['total_rech_amt'])
plt.title("The Distribution of Total Recharge Amount")
plt.xlabel("Total Recharge Amount")
plt.show() Start coding or generate with AI.
It is worth noting that there are certain customers around 100,000. Let's see how many of these customers there are.
# Check the quantiles
df_call_usage['total_rech_amt'].quantile([0.1, 0.5,0.75 ,0.9,0.97,0.99,0.999]) Start coding or generate with AI.
Observe that for 0.990 percentile the value is greater 11875. Let's check how many customer are there with more recharge amount.
# Filter the dataframe
df_call_usage[df_call_usage['total_rech_amt']>11875.0] Start coding or generate with AI.
Notice that the majority of these values are present when phone service is not selected; let us see if all of the values follow the same pattern or not.
# Check the value counts where phone service is no
df_call_usage[df_call_usage['Phone Service']=='No']['total_rech_amt'].value_counts() Start coding or generate with AI.
Observations:
The values 98769 and 0.1 in the distribution seem to be outliers or random values that need further investigation.
It is recommended to replace the value 98769 with 0 as it seems to be an incorrect data entry.
The value 0.1 could also be an incorrect data entry or an outlier that needs further investigation.
It is possible that the values 989 and 10 in the distribution could be actual recharge amounts.
It is also possible that for these values, the customers have opted for Phone Service but there was a mistake in data entry. Further investigation is needed to confirm this.
# Filter the customers who have phone service and outlier values as well
df_call_usage[(df_call_usage['Phone Service']=='Yes')&(df_call_usage['total_rech_amt'].isin([0.1,98769]))] Start coding or generate with AI.
There are no customers who recharged with 0.1 or 98769 and chose phone service. We shall now replace those with 0's.
# Replace values with 0
df_call_usage['total_rech_amt']=df_call_usage['total_rech_amt'].replace([0.1,98769],0) Start coding or generate with AI.
# Check the counts again for phone service = no
df_call_usage[df_call_usage['Phone Service']=='No']['total_rech_amt'].value_counts() Start coding or generate with AI.
sns.boxplot(df_call_usage[(df_call_usage['Phone Service']=='No')&(df_call_usage['total_rech_amt'].isin([989,10]))]) Start coding or generate with AI.
Take note that almost all of the columns have a value close to 0 or 8*E6.
Let's look at the quantiles for all of the outgoing call utilisation columns.
# Specifying the calling columns
outgoing_call_col=['roam_og', 'loc_og_t2t','loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others'] Start coding or generate with AI.
# Call Usage for customers with no phone service and high recharge amount
df_call_usage[(df_call_usage['Phone Service']=='No')&(df_call_usage['total_rech_amt'].isin([989,10]))][outgoing_call_col].apply(lambda col: col.unique()) Start coding or generate with AI.
Observations:
It is noticeable that the values are restricted to a few options including 0, 45, 46254, 8787978, and 68.
These values seem to have some relation with the recharge amount. However, it is evident that these values are unlikely to be actual data and are likely due to data entry errors.
In a real-life scenario, it would be advisable to investigate these values with the business to ensure accuracy.
However, for the purpose of this analysis, we can assume that these values are erroneous and proceed to replace all corresponding recharge amounts with 0.
Before we replace the values, let's see if this recharge amount exists when Phone Service is selected.
# Filter the values where phone service is yes and recharge amt is in 10 or 989
df_call_usage[(df_call_usage['Phone Service']=='Yes')&(df_call_usage['total_rech_amt'].isin([10,989]))] Start coding or generate with AI.
# Plot a boxplot for such values
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage[(df_call_usage['Phone Service']=='Yes')&(df_call_usage['total_rech_amt'].isin([10,989]))])
plt.title("Distribution of Variable for Customers with a Phone Service")
plt.xlabel("Call Analysis Variables")
plt.xticks(rotation=90)
plt.show() Start coding or generate with AI.
There are no outliers (8*E6) for these values.
This provides us more confidence in replacing the $10, 989 recharge amount when the phone service is not selected.
# Replacing all recharge amounts =0 where there is no phone service
df_call_usage.loc[df_call_usage['Phone Service']=='No','total_rech_amt']=0 Start coding or generate with AI.
Now that we've replaced all of the anomalies with 0, let's look at the graph for all of the data again.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage['total_rech_amt'])
plt.title("The Distribution of Total Recharge Amount")
plt.xlabel("Total Recharge Amount")
plt.show() Start coding or generate with AI.
(24984, 26)
Observation:
From our previous analysis, we found that more than 25% of customers have a total recharge amount above 1158. Hence, it is not feasible to replace these values. It is possible that these customers have actually made such a high recharge.
Regarding the customers with a recharge amount of 11875, it is worth checking their usage to verify whether this amount is valid or not.
As we previously showed, all of these columns have a maximum value in the range 8E6, which we no longer have.
Let us first see the boxplot for consumers who did not select Phone Service.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage[df_call_usage['Phone Service']=='No'])
plt.title("The Distribution of Outgoing Calls Variables for Customers with no Phone Service")
plt.xlabel("Outgoing Calls Variables")
plt.xticks(rotation=90)
plt.show() Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage[(df_call_usage['Phone Service']=='No')&(df_call_usage['total_rech_amt']==0)][outgoing_call_col])
plt.title("The Distribution of Outgoing Calls Variables for Customers with no Phone Service and Recharge Amount")
plt.xlabel("Outgoing Calls Variables")
plt.xticks(rotation=90)
plt.show() Start coding or generate with AI.
Customers who did not select Phone Service and whose total recharge amount is 0 have values of 0 or roughly 8*1E6.
Let's look at the boxplot for customers that chose Phone Service.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage[df_call_usage['Phone Service']=='Yes'][outgoing_call_col])
plt.title("The Distribution of Outgoing Calls Variables for Customers with Phone Service")
plt.xlabel("Outgoing Calls Variables")
plt.xticks(rotation=90)
plt.show() Start coding or generate with AI.
From above two boxplots we can conclude that the abnormal outgoing calls values are only for the customer who didn't opt for Phone Service. Lets analyze those further
# Analyze customers with and without phone service separatelly
df_call_usage_phone_no=df_call_usage[df_call_usage['Phone Service']=='No']
df_call_usage_phone_yes=df_call_usage[df_call_usage['Phone Service']=='Yes'] Start coding or generate with AI.
# Columns
df_call_usage_phone_no.columns Start coding or generate with AI.
Index(['Month of Joining', 'roam_ic', 'roam_og', 'loc_og_t2t', 'loc_og_t2m',
'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m', 'std_og_t2f',
'std_og_t2c', 'isd_og', 'spl_og', 'og_others', 'loc_ic_t2t',
'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m', 'std_ic_t2f',
'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others', 'total_rech_amt',
'Phone Service'],
dtype='object')# Roaming Outgoing Value Counts
df_call_usage_phone_no['roam_og'].value_counts() Start coding or generate with AI.
Observations:
Take note that the figures 8787978 and 46254 are abnormal. It is not possible for a person to make so many outbound calls in a month.
46254 minutes is approximately 32 days, which is more than a month.
As a result, we will replace both values with 0.
# Replacing these values
df_call_usage_phone_no['roam_og']=df_call_usage_phone_no['roam_og'].replace([46254,8787978],0) Start coding or generate with AI.
# Checking the roaming outgoing value counts
df_call_usage_phone_no['roam_og'].value_counts() Start coding or generate with AI.
# Check where the recharge amount is 0 but we have roaming calls
df_call_usage_phone_no[df_call_usage_phone_no['total_rech_amt']==0]['roam_og'].value_counts() Start coding or generate with AI.
We took all of the customers that did not opt for phone service, and the recharge amount is 0.
We will replace all of these values with 0 because they cannot make outbound calls without a recharge and a Phone Service plan.
# Replace the values with 0
df_call_usage_phone_no.loc[df_call_usage_phone_no['total_rech_amt']==0,'roam_og']=0 Start coding or generate with AI.
# 0 roaming calls for customers with no phone service
df_call_usage_phone_no['roam_og'].value_counts() Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage_phone_yes['roam_og'])
plt.title("The Distribution of Roaming Outgoing Calls for Customers with Phone Service")
plt.xlabel("Roaming Outgoing Calls")
plt.show() Start coding or generate with AI.
# Check the quantiles
df_call_usage_phone_yes['roam_og'].quantile([0.1, 0.5,0.75 ,0.9,0.97,0.99]) Start coding or generate with AI.
These values seem to valid. We will now determine the amount of the recharge for these customers.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.histplot(data=df_call_usage_phone_yes, x='total_rech_amt', bins=20)
plt.title("The Distribution of Total Recharge Amount for Customers with Phone Service")
plt.xlabel("Total Recharge Amount")
plt.show() Start coding or generate with AI.
df_call_usage_phone_yes[df_call_usage_phone_yes['total_rech_amt']==0].shape Start coding or generate with AI.
(262, 26)
It should be noted that the following customers (262 in total but not all unique) have chosen phone service but there is no recharge amount available.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage_phone_yes[df_call_usage_phone_yes['total_rech_amt']==0])
plt.title("The Distribution of Outgoing Call Variables for Customers with Phone Service but no recharge amount")
plt.xlabel("Outgoing Call Variables")
plt.xticks(rotation=90)
plt.show() Start coding or generate with AI.
We will not impute any other value for these customer in total_rech_amt. It is possible that company has given them some offer or they might have recharged but its not reflected in the data.
#### **Local outgoing calls within same network in minutes** Start coding or generate with AI.
# Check the counts
df_call_usage_phone_no['loc_og_t2t'].value_counts() Start coding or generate with AI.
We will replace the values 46254 and 8787978 in all outgoing call columns.
for col in outgoing_call_col:
# for col replace 46254 and 8787978 values with 0
df_call_usage_phone_no[col]=df_call_usage_phone_no[col].replace([46254,8787978],0)
# for col replace the value with 0 when there is no phone service and recharge
df_call_usage_phone_no.loc[df_call_usage_phone_no['total_rech_amt']==0,col]=0 Start coding or generate with AI.
# Value counts for all outgoing call variables for customers with no phone service
for col in outgoing_call_col:
print("value count for ",col)
print(df_call_usage_phone_no[col].value_counts())
print("-------------------------") Start coding or generate with AI.
value count for roam_og 0.0 124570 Name: roam_og, dtype: int64 ------------------------- value count for loc_og_t2t 0.0 124570 Name: loc_og_t2t, dtype: int64 ------------------------- value count for loc_og_t2m 0.0 124570 Name: loc_og_t2m, dtype: int64 ------------------------- value count for loc_og_t2f 0.0 124570 Name: loc_og_t2f, dtype: int64 ------------------------- value count for loc_og_t2c 0.0 124570 Name: loc_og_t2c, dtype: int64 ------------------------- value count for std_og_t2t 0.0 124570 Name: std_og_t2t, dtype: int64 ------------------------- value count for std_og_t2m 0.0 124570 Name: std_og_t2m, dtype: int64 ------------------------- value count for std_og_t2f 0.0 124570 Name: std_og_t2f, dtype: int64 ------------------------- value count for std_og_t2c 0.0 124570 Name: std_og_t2c, dtype: int64 ------------------------- value count for isd_og 0.0 124570 Name: isd_og, dtype: int64 ------------------------- value count for spl_og 0.0 124570 Name: spl_og, dtype: int64 ------------------------- value count for og_others 0.0 124570 Name: og_others, dtype: int64 -------------------------
Observations:
We observed that some customers who did not opt for phone service and have not recharged their account also have not placed any outgoing calls. This makes sense as they are not subscribed to the service and hence cannot make any calls.
To further analyze the outgoing call patterns, we will focus on customers who have opted for phone service.
# Creating quantile df for outgoing calls
quantile_outgoing_call_df=pd.DataFrame(columns=outgoing_call_col,index=[0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
# Entering data in quantile df
for col in outgoing_call_col:
quantile_outgoing_call_df[col]=df_call_usage_phone_yes[col].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
### **Outgoing call Quantiles for customers with Phone Service** Start coding or generate with AI.
# Quantile df
quantile_outgoing_call_df Start coding or generate with AI.
quantile_outgoing_call_df
All the values in the outgoing call columns for customers with phone service seem to be appropriate and valid. Therefore, we do not need to make any alterations to these values.
Next, we will update the df_call_usage_phone_no and df_call_usage_phone_yes dataframes to the original dataframe.
# Updating the call usage df
df_call_usage.update(df_call_usage_phone_no)
df_call_usage.update(df_call_usage_phone_yes) Start coding or generate with AI.
#Incoming call analysis Start coding or generate with AI.
# Creating incoming calls variables column list
incoming_call_col=['roam_ic','loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t',
'std_ic_t2m', 'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic',
'ic_others'] Start coding or generate with AI.
It should be noted that even if a customer does not have phone service, he or she may receive an incoming call. So we'll just look at quantiles and boxplots for all of the columns.
# Creating quantile df for incoming calls
quantile_incoming_call_df=pd.DataFrame(columns=incoming_call_col,index=[0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
# Entering incoming calls data into the quantile df
for col in quantile_incoming_call_df:F
quantile_incoming_call_df[col]=df_call_usage[col].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99])
quantile_incoming_call_df Start coding or generate with AI.
quantile_incoming_call_df
Except for roam_ic, all other columns appear to have appropriate values. We will not change them.
Let us analyze the roam_is column further.
# Counts of unique values in roam_ic column
df_call_usage['roam_ic'].value_counts() Start coding or generate with AI.
We can also observe 46254 and 8787978 figures here. Let us substitute 0 for them.
Numbers 45 and 68 are also there, but we will not update them because it is possible that they are true values rather than data entering errors.
# Replacing Outliers with 0
df_call_usage['roam_ic']=df_call_usage['roam_ic'].replace([46254,8787978],0) Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_call_usage['roam_ic'])
plt.title("The Distribution of Incoming Roaming Calls")
plt.xlabel("Incoming Roaming Calls")
plt.show() Start coding or generate with AI.
#Internet Services Start coding or generate with AI.
# Check customer data usage
df_data_usage=df[['Month of Joining',
'total_rech_data', 'vol_4g', 'vol_5g',
'Internet Service']] Start coding or generate with AI.
# Data types involved
df_data_usage.dtypes Start coding or generate with AI.
Observe that total_rech_data is object but according to data dictionary it should be float.
# Check the value counts
df_data_usage['Internet Service'].value_counts() Start coding or generate with AI.
# Total recharge data value counts
df_data_usage['total_rech_data'].value_counts() Start coding or generate with AI.
We can see there is e01 value in the column. Let's replace it with 10.
# Replacing the values
df_data_usage['total_rech_data']=df_data_usage['total_rech_data'].replace('e01',10) Start coding or generate with AI.
# Now we will convert the total rech data into a numeric column
df_data_usage['total_rech_data']=pd.to_numeric(df_data_usage['total_rech_data']) Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_data_usage['total_rech_data'])
plt.title("The Distribution of Total Recharge Data")
plt.xlabel("Total Recharge Data")
plt.show() Start coding or generate with AI.
Observe that there are some outliers.
# Check the quantiles
df_data_usage['total_rech_data'].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
Observe that there are some outliers.
# Check the quantiles
df_data_usage['total_rech_data'].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
Let's see how many customer have data recharge more than $31.
# Filtering the data
df_data_usage[df_data_usage['total_rech_data']>31] Start coding or generate with AI.
We can observe that many of them have not chosen Internet Service. Let's look at the unique numbers for total rech data where customers have not chosen Internet Service.
# Check the total recharge data unique value counts where in there is no internet service
df_data_usage[df_data_usage['Internet Service']=='No']['total_rech_data'].value_counts() Start coding or generate with AI.
The 10 and 1.6 data recharge amounts appear to be correct. We need to analyze further and decide whether to replace the values or remove the No from the Internet service.
89890 and 7987 appear to be absurd values. Because there is no Internet Service Plan, we can replace them with 0.
# Replacing absurd values with 0
df_data_usage['total_rech_data']=df_data_usage['total_rech_data'].replace([7987,89890],0) Start coding or generate with AI.
Let's look at 10 and 1.6 now.
# Checking volumes of 4g and 5g data where internet service is no
df_data_usage[(df_data_usage['Internet Service']=='No')&(df_data_usage['total_rech_data'].isin([10,1.6]))][['vol_4g','vol_5g']].apply(lambda col: col.unique()) Start coding or generate with AI.
# Checking volumes of 4g and 5g data where internet service is yes
df_data_usage[(df_data_usage['Internet Service']=='Yes')&(df_data_usage['total_rech_data'].isin([10,1.6]))][['vol_4g','vol_5g']].apply(lambda col: col.unique()) Start coding or generate with AI.
Observations:
We can clearly observe that anytime a data recharge of 10 or 1.6 is made and the Internet service is selected, the vol 4g and vol 5g have varying values.
But, when the Internet service is turned off, the results are either 0, 63, 254687, 87978, or 789.
We can go to business and inquire about this here.
# Replacing the values
df_data_usage.loc[df_data_usage['Internet Service']=='No','total_rech_data']=0 Start coding or generate with AI.
For customers who did not choose Internet service, we replaced all data with 0.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_data_usage['total_rech_data'])
plt.title("The Distribution of Total Recharge Data")
plt.xlabel("Total Recharge Data")
plt.show() Start coding or generate with AI.
There are some customers who are above the 75th percentile. We are not going to change those values.
We already discussed imputing arpu 4g and arpu 5g after analyzing this variable, so let's do it now.
# Check the value counts for total recharge data where arpu 4g and 5g is null
df[(df['arpu_4g'].isna())&(df['arpu_5g'].isna())]['total_rech_data'].value_counts(dropna=False) Start coding or generate with AI.
# Replacing the values with Not Applicable
df.loc[(df['arpu_4g'].isna())&(df['arpu_5g'].isna())&(df['total_rech_data'].isna()),['arpu_4g','arpu_5g']]='Not Applicable' Start coding or generate with AI.
### **vol_4g and vol_5g columns** Start coding or generate with AI.
# separately analyzing the internet volumes for customers with and without inetrnet service
df_data_usage_internet_no=df_data_usage[df_data_usage['Internet Service']=='No']
df_data_usage_internet_yes=df_data_usage[df_data_usage['Internet Service']=='Yes'] Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_data_usage_internet_yes['vol_4g'])
plt.title("The Distribution of 4g Volume for Customers with Internet Service")
plt.xlabel("4G Internet Volume")
plt.show() Start coding or generate with AI.
# Check the quantiles
df_data_usage_internet_yes['vol_4g'].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
We will not alter any of these values.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_data_usage_internet_no['vol_4g'])
plt.title("The Distribution of 4g Volume for Customers without Internet Service")
plt.xlabel("4G Internet Volume")
plt.show() Start coding or generate with AI.
The abnormal values are when Internet service is not opted.
# Value counts of 4g data without inetrnet service
df_data_usage_internet_no['vol_4g'].value_counts() Start coding or generate with AI.
Let's see how many of these values are there without any data recharge
# Count the values in 4G volume with no internet service and recharge data
df_data_usage_internet_no[df_data_usage_internet_no['total_rech_data']==0]['vol_4g'].value_counts() Start coding or generate with AI.
When vol-4g data is utilised in GB, the values 254687 and 87978 appear to be absurd. Let us substitute 0 for them.
We earlier saw that the customers had recharged with 10, 1.6, or other absurd quantities, and we replaced all data recharge amounts with 0.
Similarly, we will replace all of the values here.
# Replacing the value with 0
df_data_usage_internet_no.loc[df_data_usage_internet_no['total_rech_data']==0,'vol_4g']=0 Start coding or generate with AI.
# Checking the values of 4g volume without internet service
df_data_usage_internet_no['vol_4g'].value_counts() Start coding or generate with AI.
# Plot the the distribution using Box plot
plt.figure(figsize=(10,6))
sns.boxplot(df_data_usage_internet_yes['vol_5g'])
plt.title("The Distribution of 5g Volume for Customers with Internet Service")
plt.xlabel("5G Internet Volume")
plt.show() Start coding or generate with AI.
# Check the quantiles
df_data_usage_internet_yes['vol_5g'].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99]) Start coding or generate with AI.
# Check the 5g volume value counts for customers with no internet service
df_data_usage_internet_no['vol_5g'].value_counts() Start coding or generate with AI.
The abnormal values are there when Internet service is not opted.
# Value counts of 5g volume without internet service
df_data_usage_internet_no[df_data_usage_internet_no['total_rech_data']==0]['vol_5g'].value_counts() Start coding or generate with AI.
Observations:
# Replacing values with 0
df_data_usage_internet_no.loc[df_data_usage_internet_no['total_rech_data']==0,'vol_5g']=0 Start coding or generate with AI.
# Check the counts again
df_data_usage_internet_no['vol_5g'].value_counts() Start coding or generate with AI.
# update the dataframes
df_data_usage.update(df_data_usage_internet_no)
df_data_usage.update(df_data_usage_internet_yes) Start coding or generate with AI.
# update the dataframes
df_data_usage.update(df_data_usage_internet_no)
df_data_usage.update(df_data_usage_internet_yes) Start coding or generate with AI.
We have updated all of the df data usage and df call usage values.
We can update both of them in our main dataframe df.
# Update original dataframe
df.update(df_call_usage)
df.update(df_data_usage) Start coding or generate with AI.
Let's now check % of missing values in the data!
# Creating a missing value df with the null values of our original dataframe
percent_missing = df.isna().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing.values}) Start coding or generate with AI.
# Percentage of missing values in each column
missing_value_df Start coding or generate with AI.
missing_value_df
# Missing value columns with > 20 % missing values
missing_value_df[missing_value_df.percent_missing>20] Start coding or generate with AI.
Missing values greater than 20% should not be imputed because the imputation process may introduce bias and distort the true distribution of the data. Imputing a large percentage of missing values may also result in unreliable and inaccurate analysis. In such cases, it may be better to drop the columns!
Let's select all the columns with less missing values and impute them.
# We will only select columns with <20% missing values
missing_impute_cols=list(missing_value_df[(missing_value_df.percent_missing>1)&(missing_value_df.percent_missing<20)]['column_name']) Start coding or generate with AI.
# All columns with missing values less than 20 %
missing_impute_cols Start coding or generate with AI.
['roam_ic', 'roam_og', 'loc_og_t2t', 'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m', 'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others', 'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m', 'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others', 'arpu_5g', 'arpu_4g', 'aug_vbc_5g']
# Shape
df[missing_impute_cols].shape Start coding or generate with AI.
(653753, 26)
# **Multivariate Imputation by Chained Equation (MICE)* Start coding or generate with AI.
# Imputing with MICE
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import linear_model Start coding or generate with AI.
# Columns with Missing Values
missing_impute_cols Start coding or generate with AI.
['roam_ic', 'roam_og', 'loc_og_t2t', 'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m', 'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others', 'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m', 'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others', 'arpu_5g', 'arpu_4g', 'aug_vbc_5g']
# Filtering and creating a copy dataframe
df_mice = df.filter(missing_impute_cols, axis=1).copy() Start coding or generate with AI.
# Top three rows
df_mice.head(3) Start coding or generate with AI.
# Data types involved
df_mice.dtypes Start coding or generate with AI.
# Not imputing missing values now
missing_impute_cols.remove('arpu_4g')
missing_impute_cols.remove('arpu_5g') Start coding or generate with AI.
# Creating the copy dataframe for mice imputation
df_mice = df.filter(missing_impute_cols, axis=1).copy() Start coding or generate with AI.
# Define MICE Imputer and fill missing values
mice_imputer = IterativeImputer(estimator=linear_model.BayesianRidge(), n_nearest_features=None, imputation_order='ascending') Start coding or generate with AI.
# Fit Transform the data
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df_mice), columns=df_mice.columns,index=df_mice.index) Start coding or generate with AI.
# Top three rows after imputation
df_mice_imputed.head(3) Start coding or generate with AI.
# Verifying the imputation
df_mice_imputed.isna().sum() Start coding or generate with AI.
df_mice_imputed.isna().sum() Start coding or generate with AI.
# Update the original dataframe
df.update(df_mice_imputed) Start coding or generate with AI.
# Top three rows
df.head(3) Start coding or generate with AI.
#Let's do imputation for arpu_4g and arpu_5g! Start coding or generate with AI.
# Select the columns
df_arpu=df[['arpu_4g','arpu_5g']] Start coding or generate with AI.
# Filtering the 4G ARPU dataframe where values are not "Not Applicable"
df_arpu_filterd=df_arpu[df_arpu.arpu_4g!='Not Applicable'] Start coding or generate with AI.
# # Filtering the 5g ARPU dataframe where values are not "Not Applicable"
df_arpu_filterd=df_arpu_filterd[df_arpu_filterd.arpu_5g!='Not Applicable'] Start coding or generate with AI.
# Top three rows
df_arpu_filterd.head(3) Start coding or generate with AI.
# Fit transform rest of the data for 4G and 5G ARPU
df_arpu_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df_arpu_filterd), columns=df_arpu_filterd.columns,index=df_arpu_filterd.index) Start coding or generate with AI.
# Top three rows for imputed data
df_arpu_mice_imputed.head(3) Start coding or generate with AI.
#Let's update this to original dataframe! Start coding or generate with AI.
# Verifying that there are no missing values
df_arpu_mice_imputed.isna().sum() Start coding or generate with AI.
# Updating the main dataframe
df.update(df_arpu_mice_imputed) Start coding or generate with AI.
# Verifying that there are no missing values
df[df['arpu_4g'].isna()] Start coding or generate with AI.
# Verifying that there are no missing values
df.isna().sum() Start coding or generate with AI.
# Resetting Index
df=df.reset_index(drop=False) Start coding or generate with AI.
#Bivariate Analysis Start coding or generate with AI.
#Filling rest of the missing values with 0 Start coding or generate with AI.
# replace arpu 4G NA to 0
df['arpu_5g'].replace("Not Applicable", 0, inplace=True)
df['arpu_4g'].replace("Not Applicable", 0, inplace=True)
df["total_rech_data"] = df["total_rech_data"].fillna(0)
df['total_rech_amt'] = df['total_rech_amt'].fillna(0)
df['night_pck_user'] = df['night_pck_user'].fillna(0)
df['fb_user'] = df['fb_user'].fillna(0) Start coding or generate with AI.
# Converting categorical variables back to category type For ex- Month is stored in numerical format
other_cat_vars = ['Month', 'Month of Joining', 'zip_code', 'latitude', 'longitude','night_pck_user', 'fb_user', 'Churn Value']
for var in other_cat_vars:
df[var] = df[var].astype('object')
# list of categorical variables
cat_vars = list(df.select_dtypes(include=['object']).columns) Start coding or generate with AI.
# list of numerical variables
num_vars = list(df.select_dtypes(include=['int64', 'float64']).columns) Start coding or generate with AI.
len(num_vars) Start coding or generate with AI.
36
# Month and churn Value Start coding or generate with AI.
# Plot the churn by month
plt.figure(figsize=(10,6))
sns.countplot(x="Month", hue="Churn Value", data=df)
plt.title("Countplot of churned vs. non-churned customers for each month")
churn_by_month = df.groupby('Month')['Churn Value'].mean()
print(churn_by_month) Start coding or generate with AI.
Observation:
We can observe that January has the highest churn rate of 36.44%, followed by February with a churn rate of 16.57%. March has a churn rate of 14.84%, which is slightly lower than February. After that, the churn rate decreases steadily for the remaining months, with the lowest churn rate being in July with only 1.87%.
This indicates that there may be some seasonal patterns in the customer churn rate, with the churn rate being highest during the first few months of the year and gradually decreasing towards the end of the year.
#Age and Churn value Start coding or generate with AI.
# Put ages into bins
bins = [0, 18, 25, 35, 50, 65, 100]
labels = ['<18', '18-24', '25-34', '35-49', '50-64', '65+']
df['age_bucket'] = pd.cut(df['Age'], bins=bins, labels=labels)
# Plot Age vs Churn Distribution
plt.figure(figsize=(10,6))
sns.histplot(data=df, x='age_bucket', hue='Churn Value', multiple='stack')
plt.title('Distribution of Ages for Churned vs. Non-Churned Customers')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show() Start coding or generate with AI.
# Get count of churned and non-churned customers
age_churn_counts = df.groupby(['age_bucket', 'Churn Value'])['Churn Value'].count()
print(age_churn_counts) Start coding or generate with AI.
age_bucket Churn Value
<18 0 7950
1 349
18-24 0 99723
1 4816
25-34 0 252430
1 12078
35-49 0 177088
1 8367
50-64 0 70362
1 3475
65+ 0 14495
1 705
Name: Churn Value, dtype: int64
# calculate churn rate for each age bucket
age_churn_rates = df.groupby('age_bucket')['Churn Value'].mean()
print(age_churn_rates) Start coding or generate with AI.
age_bucket <18 0.042053 18-24 0.046069 25-34 0.045662 35-49 0.045116 50-64 0.047063 65+ 0.046382 Name: Churn Value, dtype: float64
Observation:
There doesn't seem to be a significant difference in churn rates across the different age buckets. The churn rate ranges from 4.49% to 4.71% across the different age buckets, with the highest churn rate being in the 50-64 age bucket. However, the difference in churn rate between the age buckets is relatively small, with none of them being significantly higher or lower than the others.
This suggests that age may not be a strong predictor of churn on its own, and that other variables may be more important in determining whether a customer will churn or not.
#Age and ARPU Start coding or generate with AI.
# Create a box plot to compare the distribution of ARPU for churned vs. non-churned customers in each age bucket
plt.figure(figsize=(10,6))
sns.boxplot(data=df, x='age_bucket', y='arpu', hue='Churn Value')
plt.title('ARPU Distribution for Churned vs. Non-Churned Customers in Each Age Bucket')
plt.xlabel('Age Bucket')
plt.ylabel('ARPU')
plt.show()
# Calculate the mean ARPU for churned vs. non-churned customers
age_arpu = df.groupby(['age_bucket', 'Churn Value'])['arpu'].mean().reset_index()
print(age_arpu) Start coding or generate with AI.
Observation:
In general, we can see that the average revenue per user (ARPU) is higher for churned customers in all age buckets. This may indicate that customers who spend more on the service are more likely to churn, which is a cause for concern for the company.
We can see that the churn rates for different age groups are similar, but the average revenue per user (ARPU) is higher for older customers who did not churn. This suggests that older customers may be more loyal to the product and less likely to switch to a competitor, but they also value consistency and are less likely to increase their spending on the product.
#Age and Customer satisfaction score Start coding or generate with AI.
# boxplot to visualize the distribution of satisfaction scores for each age bucket
plt.figure(figsize=(10,6))
sns.boxplot(data=df, x='age_bucket', y='Satisfaction Score', hue='Churn Value')
plt.title('Distribution of Satisfaction Scores by Age and Churn')
plt.xlabel('Age Bucket')
plt.ylabel('Satisfaction Score')
plt.show()
# Calculate the mean satisfaction score for each age bucket and churn group
age_satisfaction = df.groupby(['age_bucket', 'Churn Value']).agg({'Satisfaction Score': 'mean'}).reset_index()
# Pivot the table to make the churn values into columns
age_satisfaction = age_satisfaction.pivot(index='age_bucket', columns='Churn Value', values='Satisfaction Score')
print(age_satisfaction) Start coding or generate with AI.
Observation:
Overall, the mean satisfaction scores are relatively similar across all age buckets, ranging from 3.18 to 3.19.
Let's see the most frequent reason for churn in each age group.
# Create a pivot table with churn reasons as columns and age buckets as rows
pivot_table = pd.pivot_table(df, index='age_bucket', columns='Churn Reason', values='Churn Value', aggfunc='sum')
# Remove the "Unknown" category from the pivot table
age_pivot = pivot_table[pivot_table.columns.drop("Unknown")]
# Get the top three most frequent reasons for churn in each age bucket
top_three = age_pivot.apply(lambda x: x.nlargest(3).index.tolist(), axis=1).to_frame(name='Top Three Reasons') Start coding or generate with AI.
# Let's look into the pivot table
age_pivot Start coding or generate with AI.
# top three reasons
top_three = top_three.reset_index().drop(0).reset_index(drop=True) Start coding or generate with AI.
top_three Start coding or generate with AI.
top_three
# Print the top reasons people churned in each age bucket
for i, row in top_three.reset_index().iterrows():
print(row['age_bucket'], row['Top Three Reasons']) Start coding or generate with AI.
18-24 ['Attitude of support person', 'Competitor offered more data', 'Lack of self-service on Website'] 25-34 ['Attitude of support person', 'Lack of self-service on Website', 'Service dissatisfaction'] 35-49 ['Lack of self-service on Website', 'Competitor had better devices', 'Moved'] 50-64 ['Competitor had better devices', 'Service dissatisfaction', 'Moved'] 65+ ["Don't know", 'Attitude of support person', 'Competitor offered more data']
Observations:
The most common churn reasons are related to dissatisfaction with the service provided by the company, followed by competition from other service providers for customers across all age brackets.
Attitude of support person was one of the top reasons for churn across all age groups. The company could invest in training their support staff to provide better customer service, and focus on addressing customer complaints professionally and effectively.
Competitors offering better devices and more data were a common reason for churn in the 35-49 and 65+ age groups. The company could consider offering competitive deals and promotions to retain customers and attract new ones.
Lack of self-service on the website was a common reason for churn in the 18-24 and 25-34 age groups, and was also a contributing factor in the 35-49 age group. The company could work on improving their website and mobile app to offer more self-service options, such as account management and bill payment, to make it easier for customers to manage their accounts.
#Negative ARPU Start coding or generate with AI.
Univarite Data Analysis Observation:
The fact that 17% of the people have negative arpu means that a significant proportion of the customer base is not generating revenue for the telecom company.
This could be due to a number of reasons, such as inactive accounts, delinquent accounts, or accounts with outstanding balances.
To understand more about this issue, we can investigate other variables that may be related to negative arpu, such as churn status. We can also look at the distribution of negative arpu values across different demographic groups, such as age, gender, and location, to see if there are any patterns or trends.
Additionally, we can explore whether there are any correlations between negative arpu and other variables, such as usage patterns, service subscriptions, and customer satisfaction scores to understand what's causing this behaviour.
Overall, further investigation is needed to fully understand the implications of negative arpu for the telecom company.
# calculate the percentage of customers with negative ARPU
negative_arpu_percentage = (df[df['arpu'] < 0]['arpu'].count() / df.shape[0]) * 100
# create a bar plot to visualize the negative ARPU percentage
plt.figure(figsize=(10,6))
plt.bar(['Negative ARPU', 'Positive ARPU'], [negative_arpu_percentage, 100 - negative_arpu_percentage])
plt.title('Percentage of Customers with Negative ARPU')
plt.xlabel('ARPU')
plt.ylabel('Percentage')
plt.show() Start coding or generate with AI.
# the relationship between negative ARPU and churn status
churn_negative_arpu = df[df['arpu'] < 0]['Churn Value'].value_counts(normalize=True)
# create a pie chart to visualize the churn status of customers with negative ARPU
plt.figure(figsize=(10,6))
plt.pie(churn_negative_arpu, labels=churn_negative_arpu.index, autopct='%1.1f%%')
plt.title('Churn Status of Customers with Negative ARPU')
plt.show() Start coding or generate with AI.
Observation:
95.2% of customers with negative ARPU did not churn, while only 4.8% of customers with negative ARPU churned.
This suggests that having negative ARPU is not necessarily a strong predictor of churn. However, further analysis is needed to fully understand the relationship between negative ARPU and churn. Other factors, such as usage patterns, service subscriptions, and customer satisfaction, may also play a role in predicting churn.
It's possible that the negative ARPU is a result of money spent by the company to retain customers but negative ARPU could also be a result of other factors, such as accounts that have been inactive for a long time or delinquent accounts with outstanding balances.
Additionally, the 4.8% of customers with negative ARPU who churned may have done so for reasons such as finding a better deal elsewhere.
# Lets analyze further
negative_arpu_churn = df[(df['arpu'] < 0) & (df['Churn Value'] == 1)]
print("Number of customers with negative ARPU who churned: ", len(negative_arpu_churn['Customer ID'])) Start coding or generate with AI.
Number of customers with negative ARPU who churned: 5610
# count plot of negative churned
plt.figure(figsize=(10,6))
sns.countplot(x='Satisfaction Score', hue='Churn Value', data=negative_arpu_churn)
# add labels and title
plt.xlabel('Satisfaction Score')
plt.ylabel('Count')
plt.title('Churn Status by Satisfaction Score')
plt.show() Start coding or generate with AI.
# calculate the average customer satisfaction score for this group
avg_satisfaction = negative_arpu_churn['Satisfaction Score'].mean()
print("Average customer satisfaction score:", round(avg_satisfaction,2)) Start coding or generate with AI.
Average customer satisfaction score: 2.0
Observation:
The average customer satisfaction score for churned customers with negative ARPU is 2, it suggests that these customers were not satisfied with the services provided by the telecom company.
It could mean that the company needs to improve its services and support to retain its customers.
# count the number of customers for each churn reason
churn_reason_counts = negative_arpu_churn['Churn Reason'].value_counts()
# plot the churn reason counts
plt.figure(figsize=(16,6))
sns.barplot(x=churn_reason_counts.index, y=churn_reason_counts)
plt.title('Churn Reason for Customers with Negative ARPU and Churned')
plt.xlabel('Churn Reason')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show() Start coding or generate with AI.
Observation:
The distribution of churn reasons is quite similar, which makes it difficult to draw clear conclusions. However, there are a few observations we can make.
The most common reason for churn is "Unknown," which could suggest that the company needs to do a better job of tracking and understanding customer behavior and reasons for leaving.
Other common reasons for churn include price, service dissatisfaction, and network reliability, which could indicate that the company needs to focus on improving the quality of its service and offering competitive pricing to retain customers.
The fact that many customers cited competitor offerings as a reason for leaving suggests that the company may need to do a better job of differentiating itself in the market.
The presence of unusual reasons such as "43tgeh" with only 16 counts could indicate data entry errors or outliers that should be investigated and corrected.
### **ARPU and Customer Demographics** Start coding or generate with AI.
# Plot arpu vs gender
plt.figure(figsize=(10,6))
sns.boxplot(x='Gender', y='arpu', data=df)
plt.title('ARPU for Genders')
plt.xlabel('ARPU')
plt.ylabel('Gender')
plt.show() Start coding or generate with AI.
Observation:
# Plot arpu vs married
plt.figure(figsize=(10,6))
sns.boxplot(x='Married', y='arpu', data=df)
plt.title('ARPU for Married Customers')
plt.xlabel('ARPU')
plt.ylabel('Married')
plt.show() Start coding or generate with AI.
ARPU and Churn Value
# plot arpu vs churn
plt.figure(figsize=(10,6))
sns.boxplot(x='Churn Value', y='arpu', data=df)
plt.title('ARPU for Churned and Not Churned Customers')
plt.xlabel('ARPU')
plt.ylabel('Churn')
plt.show() Start coding or generate with AI.
#ARPU 4G Start coding or generate with AI.
# Box plot for ARPU 4G against churn
plt.figure(figsize=(10,6))
sns.boxplot(x='Churn Value', y='arpu_4g', data=df)
plt.title('ARPU 4G vs Churn')
plt.show() Start coding or generate with AI.
# Scatter plot for ARPU 4G against total data usage
plt.figure(figsize=(10,6))
sns.scatterplot(x='vol_4g', y='arpu_4g', data=df)
plt.title('ARPU 4G vs Total Data Usage on 4G network')
plt.show() Start coding or generate with AI.
Observations:
Based on the boxplot of ARPU for 4G network versus total data usage on the 4G network, we can observe that there are some outliers in the upper left corner of the plot. These outliers indicate that there are some customers who have high ARPU despite low total data usage on the 4G network.
One possible explanation for this observation could be that these customers are using other services besides data, such as voice and messaging, which are not measured in this analysis. Additionally, they could be subscribing to premium services, which are charged at a higher rate than regular data usage.
Another possible explanation could be that these customers have a fixed data plan, meaning that they are charged a fixed amount of money regardless of their actual data usage. In this case, their high ARPU could be due to additional charges for services other than data.
On the other hand, we can also observe that there are some customers with high total data usage on the 4G network but low ARPU. This could be due to various reasons such as low usage of other services, opting for cheaper data plans, or using other networks for services that are not provided on the 4G network.
# Night pack user Start coding or generate with AI.
# Box plot for Night pack user and ARPU
plt.figure(figsize=(10,6))
sns.boxplot(x='night_pck_user', y='arpu', data=df)
plt.title('Distribution of ARPU for Night Pack Users')
plt.show() Start coding or generate with AI.
A sample's distribution of a categorical variable must frequently be compared to the distribution of a categorical variable in another sample.
The chi squared test of independence, often known as the chi squared test of association, is used to detect whether or not categorical variables are related.
The data must meet the following requirements for this test:
# Categorical variables
cat_vars_for_test = [
'Gender',
'Married',
'Dependents',
'night_pck_user',
'fb_user',
'Referred a Friend',
'Phone Service',
'Multiple Lines',
'Internet Service',
'Internet Type',
'Online Security',
'Online Backup',
'Device Protection Plan',
'Premium Tech Support',
'Streaming TV',
'Streaming Movies',
'Streaming Music',
'Unlimited Data',
'Churn Category',
'offer'] Start coding or generate with AI.
# Chi-square tests
from scipy.stats import chi2_contingency
for var in cat_vars_for_test:
# Create binary encoded variables for categorical variable
df_encoded = pd.get_dummies(df[var], prefix=var, drop_first=True)
df_encoded['Churn Value'] = df['Churn Value']
# Calculate chi-square test statistic and p-value
chi2, p_val, dof, expected = chi2_contingency(pd.crosstab(df_encoded.iloc[:, 1], df_encoded['Churn Value']))
if p_val < 0.05:
print("---------------------------------")
print(f'Statistically significant association between {var} and Churn')
print('P-Value:', p_val)
else:
pass Start coding or generate with AI.
--------------------------------- Statistically significant association between Married and Churn P-Value: 0.040058331482657236 --------------------------------- Statistically significant association between night_pck_user and Churn P-Value: 0.0 --------------------------------- Statistically significant association between fb_user and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Referred a Friend and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Phone Service and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Internet Service and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Online Security and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Online Backup and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Device Protection Plan and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Premium Tech Support and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Streaming TV and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Streaming Movies and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Streaming Music and Churn P-Value: 0.0 --------------------------------- Statistically significant association between Churn Category and Churn P-Value: 0.0 --------------------------------- Statistically significant association between offer and Churn P-Value: 1.7025679102311887e-171
ANOVA, or Analysis of Variance, is a statistical method used to test whether there are significant differences between the means of two or more groups. It is a generalization of t-tests, which can only be used to compare two groups, and is commonly used in experimental design and analysis.
In order to perform ANOVA, you need at least one continuous variable and one categorical variable that separates your data into comparison groups. The test compares the variance of group means to the variance of the groups, which allows you to determine if the groups are part of a larger population or if they are separate populations with different means.
The null hypothesis for ANOVA is that all groups have the same mean, while the alternative hypothesis is that at least one group has a different mean. The test calculates
It is important to note that ANOVA only tells you that there is a difference between groups, but not where the difference is.
Overall, ANOVA is a powerful statistical tool that can help you understand whether there are significant differences between the means of different groups. It can be used in a wide range of fields, including biology, psychology, and engineering, to help researchers gain insights into their data and make informed decisions.
from scipy.stats import f_oneway
# Loop over each numeric variable and test its association with churn using ANOVA
for var in num_vars:
churn_yes = df[df['Churn Value'] == 1][var]
churn_no = df[df['Churn Value'] == 0][var]
f_statistic, p_value = f_oneway(churn_yes, churn_no)
if p_value < 0.05:
print('Statistically Significant')
print('Variable:', var)
print('Churn Yes Mean:', churn_yes.mean())
print('Churn No Mean:', churn_no.mean())
print('F-Statistic:', f_statistic)
print('P-Value:', p_value)
print('------------------')
else:
pass Start coding or generate with AI.
Statistically Significant Variable: arpu Churn Yes Mean: 893.4254240750041 Churn No Mean: 775.823964429513 F-Statistic: 120.70203292065581 P-Value: 4.465829505159502e-28 ------------------ Statistically Significant Variable: roam_ic Churn Yes Mean: 350.16563614641626 Churn No Mean: 244.8377824879824 F-Statistic: 1758.3844171984638 P-Value: 0.0 ------------------ Statistically Significant Variable: roam_og Churn Yes Mean: 289.38338317933994 Churn No Mean: 266.5100433968033 F-Statistic: 38.100583561457384 P-Value: 6.722927867842586e-10 ------------------ Statistically Significant Variable: loc_og_t2t Churn Yes Mean: 1377.5606883197397 Churn No Mean: 808.6374377877157 F-Statistic: 3671.314985696175 P-Value: 0.0 ------------------ Statistically Significant Variable: loc_og_t2m Churn Yes Mean: 904.6001256618706 Churn No Mean: 667.7330641313714 F-Statistic: 1294.5870132818925 P-Value: 3.2206479893713164e-283 ------------------ Statistically Significant Variable: loc_og_t2f Churn Yes Mean: 65.22935893728761 Churn No Mean: 30.691714258822163 F-Statistic: 9908.714968214927 P-Value: 0.0 ------------------ Statistically Significant Variable: loc_og_t2c Churn Yes Mean: 35.2145591177041 Churn No Mean: 30.094724212289467 F-Statistic: 163.47107651119447 P-Value: 1.9943462853736474e-37 ------------------ Statistically Significant Variable: std_og_t2t Churn Yes Mean: 758.2018267843406 Churn No Mean: 568.9617944243469 F-Statistic: 598.1317159385686 P-Value: 4.8949158805886015e-132 ------------------ Statistically Significant Variable: std_og_t2m Churn Yes Mean: 592.8742377838105 Churn No Mean: 433.8048608197123 F-Statistic: 736.3097719658266 P-Value: 4.681690440668445e-162 ------------------ Statistically Significant Variable: std_og_t2f Churn Yes Mean: 49.290821476134575 Churn No Mean: 34.18853820207627 F-Statistic: 1943.3836699760814 P-Value: 0.0 ------------------ Statistically Significant Variable: isd_og Churn Yes Mean: 64.1541795368837 Churn No Mean: 48.91146110634266 F-Statistic: 516.7523100047864 P-Value: 2.385725154265906e-114 ------------------ Statistically Significant Variable: spl_og Churn Yes Mean: 163.26064856063107 Churn No Mean: 85.408631720455 F-Statistic: 6126.457560903925 P-Value: 0.0 ------------------ Statistically Significant Variable: og_others Churn Yes Mean: 142.13263232537466 Churn No Mean: 96.20278469873027 F-Statistic: 2304.0674064190443 P-Value: 0.0 ------------------ Statistically Significant Variable: loc_ic_t2t Churn Yes Mean: 1174.6660079131425 Churn No Mean: 831.0468103925479 F-Statistic: 2282.9727023467954 P-Value: 0.0 ------------------ Statistically Significant Variable: loc_ic_t2m Churn Yes Mean: 1049.6813997396084 Churn No Mean: 705.0036822623703 F-Statistic: 3338.249851419252 P-Value: 0.0 ------------------ Statistically Significant Variable: loc_ic_t2f Churn Yes Mean: 458.22352482376544 Churn No Mean: 324.8259761135374 F-Statistic: 2178.954514371851 P-Value: 0.0 ------------------ Statistically Significant Variable: std_ic_t2t Churn Yes Mean: 639.3684766619502 Churn No Mean: 356.51316467619193 F-Statistic: 5773.850099667962 P-Value: 0.0 ------------------ Statistically Significant Variable: std_ic_t2m Churn Yes Mean: 417.34890408521306 Churn No Mean: 301.0685030521993 F-Statistic: 2018.4795224824138 P-Value: 0.0 ------------------ Statistically Significant Variable: std_ic_t2f Churn Yes Mean: 179.37291350416544 Churn No Mean: 123.1443383266374 F-Statistic: 2619.9399053419847 P-Value: 0.0 ------------------ Statistically Significant Variable: spl_ic Churn Yes Mean: 0.2981508125083786 Churn No Mean: 0.24906393109674982 F-Statistic: 357.4943321682601 P-Value: 1.0387883884703116e-79 ------------------ Statistically Significant Variable: isd_ic Churn Yes Mean: 463.1104070172861 Churn No Mean: 241.1419209795283 F-Statistic: 7256.929877140321 P-Value: 0.0 ------------------ Statistically Significant Variable: ic_others Churn Yes Mean: 185.8148126612339 Churn No Mean: 142.0030830377494 F-Statistic: 637.2801651255587 P-Value: 1.5243427936498294e-140 ------------------ Statistically Significant Variable: total_rech_amt Churn Yes Mean: 2045.6858195211787 Churn No Mean: 1669.884963326751 F-Statistic: 453.55974233696725 P-Value: 1.311364610623304e-100 ------------------ Statistically Significant Variable: total_rech_data Churn Yes Mean: 4.1900217646074 Churn No Mean: 3.2517887826020053 F-Statistic: 481.62477496793986 P-Value: 1.0348137316319403e-106 ------------------ Statistically Significant Variable: vol_4g Churn Yes Mean: 206.60916390423574 Churn No Mean: 191.40129077654962 F-Statistic: 18.802360751426484 P-Value: 1.4500919978692099e-05 ------------------ Statistically Significant Variable: vol_5g Churn Yes Mean: 2706.0580063619623 Churn No Mean: 2218.471985885287 F-Statistic: 322.76212872582926 P-Value: 3.772769831957074e-72 ------------------ Statistically Significant Variable: aug_vbc_5g Churn Yes Mean: 634.7004497845409 Churn No Mean: 525.5437112451415 F-Statistic: 172.60873964815855 P-Value: 2.015676603961785e-39 ------------------ Statistically Significant Variable: Satisfaction Score Churn Yes Mean: 1.9873430436966348 Churn No Mean: 3.1864918062216296 F-Statistic: 27367.398959997576 P-Value: 0.0 ------------------
#Churn value Start coding or generate with AI.
# Box plot for Churn value and total recharge amount
plt.figure(figsize=(10,6))
sns.boxplot(x='Churn Value', y='total_rech_amt', data=df)
plt.title('Total recharge amount')
plt.show() Start coding or generate with AI.
Observation:
It appears that customers who churned had a higher total recharge amount compared to customers who did not churn. This observation suggests that the amount of money that a customer spends on recharges may be a significant factor in their decision to churn.
This observation can provide valuable insights for the business. If customers who churned had a higher total recharge amount, it could mean that they were dissatisfied with the quality or value of the services provided by the company, even though they were spending more money. Therefore, the business may need to investigate and address the reasons behind the dissatisfaction to retain high-value customers.
In addition, the observation that customers who churned had a higher total recharge amount suggests that identifying and targeting high-value customers for retention efforts may be an effective strategy. By focusing on retaining customers who spend more money, the business can potentially improve their overall revenue and profitability.
# Reffered a frd and churn Start coding or generate with AI.
Hypothesis:
# pivot table
pivot_table = pd.pivot_table(df, values=['Referred a Friend', 'Number of Referrals'], index=['Churn Value'], aggfunc=np.mean)
# Print the pivot table
print(pivot_table)
# Plot the pivot table
plt.figure(figsize=(10,6))
sns.countplot( x='Referred a Friend', data=df, hue='Churn Value')
plt.title('Referred a Friend vs Churn Value')
plt.show()
plt.figure(figsize=(10,6))
sns.barplot(x='Churn Value', y='Number of Referrals', data=df)
plt.title('Number of Referrals vs Churn Value')
plt.show() Start coding or generate with AI.
#Let's calculate churn rate in both groups! Start coding or generate with AI.
refer_group = df.groupby("Referred a Friend")
refer_group_churn_rates = refer_group["Churn Value"].mean() * 100
refer_group_churn_rates Start coding or generate with AI.
Observation:
If customers who have referred others to the company are churning more, it could suggest that the company's referral program is not effective at retaining customers. This may be due to a variety of reasons, such as:
The referral program may not be providing sufficient incentives or rewards to encourage customers to stay. Customers who refer others may have higher expectations of the company and its services, and may be more likely to churn if those expectations are not met.
# To make the analysis more thorough, we can combine the Phone Service column with other related columns such as Multiple Lines, Internet Service, and Internet Type. This would give us a better understanding of the customer's overall service usage.
# Cross-tabulate Phone Service with Multiple Lines
pd.crosstab(df['Phone Service'], df['Multiple Lines'], dropna=False) Start coding or generate with AI.
By analyzing the phone service column, you can identify how many customers have opted for phone services and how it impacts churn rate and other variables. For example, you may find that a large proportion of customers who churned did not have phone services. This could indicate that offering phone services could be an effective way to improve customer retention.
On the other hand, you may find that customers with phone services are more likely to churn. This may indicate that there are issues with the phone services that need to be addressed.
To gain a more detailed understanding of the impact of phone services on customer behavior and churn rate, you can combine the phone service column with other variables such as demographics, usage patterns, and service types. This can help you identify specific customer segments that are more likely to benefit from phone services or may be more sensitive to phone service issues. For example, you may find that older customers are more likely to value phone services, or that customers who use a lot of data may be less likely to need phone services.
Overall, analyzing the phone service column can provide valuable insights into how offering phone services impacts customer behavior and can inform strategies to improve customer retention and satisfaction.
### **Internet Service and Type** Start coding or generate with AI.
# Count plot between internet service and type
plt.figure(figsize=(10,6))
sns.countplot(x='Internet Service', hue='Internet Type', data=df)
plt.title("Count plot for different Internet Types")
plt.show() Start coding or generate with AI.
df.groupby(['Internet Service', 'Internet Type'])['Churn Value'].mean() Start coding or generate with AI.
Observation:
If the churn rates for all types of internet services are similar, it may indicate that the type of internet service does not have a significant impact on churn. However, it is also possible that there are other factors at play, such as pricing or customer service, that are more important drivers of churn. Further analysis, such as comparing the churn rates for different pricing tiers or customer service ratings, may be necessary to better understand the relationship between internet service and churn.
### **Unlimited Data** Start coding or generate with AI.
# distribution of data users for revenue
df.groupby('Unlimited Data')[['total_rech_amt','total_rech_data','arpu','Churn Value']].mean() Start coding or generate with AI.
# Plot a bar graph of the Streaming Movies column
plt.figure(figsize=(10,6))
sns.countplot(x='Streaming Movies', data=df, hue="Churn Value")
plt.title("Plot for churn value for different streaming movies")
plt.show() Start coding or generate with AI.
The company may want to focus on increasing the percentage of customers using streaming movies by improving the quality of the service or offering promotions - check with ARPU, churn value and satisfaction. if +ve effect on any of them,
The company may want to investigate why such a large percentage of customers do not use streaming movies and address any concerns or issues that may be preventing them from using the service , understand why and recommend measures.
Hypothesis: Customers with higher ARPU 5G are more likely to be satisfied with their service and less likely to churn.
Try to experiment with these hypotheses and generate insights.
Hypotheses:
Customers who have subscribed for premium services such as fiber optic internet, streaming movies, and device protection plan are more likely to have a higher satisfaction score.
Customers who have a longer tenure with the company are more likely to have a higher satisfaction score.
## **Multivariate Analysis** Start coding or generate with AI.
# Relationship between arpu, satisfaction score and churn value
# the three major pillars of service quality
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='Satisfaction Score',y='arpu', hue='Churn Value')
plt.title("Scatter Plot betwen Satisfaction Score and ARPU")
plt.show() Start coding or generate with AI.
Observation:
Based on the scatterplot, we can observe that the customers who have churned are mostly concentrated in the top left quadrant, where their satisfaction score is low but their average revenue per user (arpu) is high. This indicates that despite the customers generating high revenue, they are not satisfied with the service provided and hence are likely to churn.
On the other hand, customers who are satisfied with the service and are less likely to churn are mostly concentrated in the bottom right quadrant, where their satisfaction score is high and their arpu is comparatively low.
This information can be used by the telecom company to focus on improving the customer satisfaction score, as it is a key factor in retaining customers. They could analyze the reasons behind the low satisfaction score of high-paying customers and take measures to improve their overall experience. This could include providing better network coverage, offering customized plans, improving customer service, or providing incentives to loyal customers.
# Creating correlation dataframe and generating heatmap
df_corr = df[num_vars].corr()
mask = np.zeros_like(df_corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(20,12))
sns.heatmap(df_corr, cmap='coolwarm', annot=True, fmt='.2f', mask=mask) Start coding or generate with AI.
# Printing highly correlated variables
corr_pairs = []
for i in range(len(df_corr.columns)):
for j in range(i):
if abs(df_corr.iloc[i, j]) >= 0.5:
corr_pairs.append((df_corr.columns[i], df_corr.columns[j]))
if corr_pairs:
print("Highly correlated variable pairs:")
for pair in corr_pairs:
print(pair[0], "and", pair[1])
else:
print("No highly correlated variable pairs found.") Start coding or generate with AI.
Highly correlated variable pairs: loc_og_t2f and loc_og_t2t std_og_t2m and loc_og_t2c std_og_t2m and std_og_t2t std_og_t2f and roam_ic std_og_t2f and loc_og_t2m isd_og and std_og_t2t isd_og and std_og_t2m spl_og and loc_og_t2t spl_og and loc_og_t2f og_others and roam_ic og_others and loc_og_t2m og_others and std_og_t2f loc_ic_t2m and roam_ic loc_ic_t2m and std_og_t2f loc_ic_t2m and og_others loc_ic_t2m and loc_ic_t2t loc_ic_t2f and loc_ic_t2m std_ic_t2t and loc_og_t2t std_ic_t2t and spl_og std_ic_t2m and loc_ic_t2m std_ic_t2m and loc_ic_t2f std_ic_t2f and og_others std_ic_t2f and loc_ic_t2t std_ic_t2f and loc_ic_t2m std_ic_t2f and loc_ic_t2f std_ic_t2f and std_ic_t2m spl_ic and std_og_t2t isd_ic and loc_og_t2t isd_ic and loc_og_t2f isd_ic and spl_og isd_ic and std_ic_t2t ic_others and std_og_t2m
Chatterjee correlation is a coefficient of correlation proposed by Sourav Chatterjee that measures the degree of dependence between two random variables without assuming anything about their distributions. It is a simple and interpretable measure of dependence that ranges between 0 and 1, where 0 indicates independence and 1 indicates that one variable is a measurable function of the other.
Compared to other correlation coefficients, the Chatterjee Correlation is less sensitive to outliers and non-linear relationships between variables. However, it requires more computational resources due to its non-parametric nature.
One advantage of the Chatterjee correlation is that it has a simple asymptotic theory under the hypothesis of independence, similar to classical coefficients like Pearson's and Spearman's correlations. This means that we can use it to test the null hypothesis of independence between two variables.
The calculation involves the following steps:
Rank the values of Y in the data set, assigning the smallest value the rank of 1, the second smallest the rank of 2, and so on. Store these ranks in a new column called 'rank'.
Rank the values of X in the data set, assigning the smallest value the rank of 1, the second smallest the rank of 2, and so on. Store these ranks in a new column called 'rank_x'.
Sort the data set in increasing order of 'rank_x'.
Calculate the absolute difference between the ranks of Y for consecutive values of X. Sum up these differences.
Calculate the Chatterjee correlation coefficient using the formula: Chatterjee correlation coefficient = 1 - 3 * (sum of absolute differences in ranks) / (N^2 - 1), where N is the number of observations in the data set.
The resulting Chatterjee correlation coefficient is a number between 0 and 1, with 0 indicating no dependence between the two variables and 1 indicating complete dependence.
Here's a link to the the paper.
#function for finding chatterejee correlation coefficient
def chatterjee_corr(df,x,y):
N=df.shape[0]
df_rank=df
df_rank['rank']=df_rank[y].rank()
# print(df_rank['rank'])
df_rank['rank_x']=df_rank[x].rank()
df_rank=df_rank.sort_values(by='rank_x')
# 1-3*(abs(sum(xri-1 - Xir))/n square-1
# Diff function --> summation -> absolute
chatt_corr=1- (3*df_rank['rank'].diff().abs().sum() )/ (pow(N,2)-1)
return chatt_corr Start coding or generate with AI.
# Taking few important variables generated from the insights earlier
num_vars_test = ['Age','Number of Dependents','arpu','roam_ic',
'roam_og', 'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g','arpu_4g',
'aug_vbc_5g','Number of Referrals','Streaming Data Consumption','Satisfaction Score'] Start coding or generate with AI.
# Looping over all numerical variables to check chatterjee correlation
for i, var in enumerate(num_vars_test):
for j in range(i + 1, len(num_vars_test)):
corr = chatterjee_corr(df,var, num_vars_test[j])
if corr > 0.6:
print(var, 'and', num_vars_test[j], ':', corr)
Start coding or generate with AI.
Age and Number of Dependents : 0.6851975225950853 total_rech_data and arpu_5g : 0.6644614940888286 total_rech_data and arpu_4g : 0.6680885946973494 vol_4g and vol_5g : 0.6343169121632295 arpu_5g and arpu_4g : 0.7502363507931655
Observation:
Age and number of dependents have a moderate positive correlation (0.68). This suggests that as age increases, the number of dependents tends to increase as well.
Total recharge data and arpu_5g have a moderate positive correlation (0.66). This suggests that customers who spend more on data tend to have higher revenue per user for 5G services.
Total recharge data and arpu_4g have a moderate positive correlation (0.67). This suggests that customers who spend more on data tend to have higher revenue per user for 4G services.
Vol_4g and vol_5g have a moderate positive correlation (0.63). This suggests that customers who use more 4G data tend to use more 5G data as well.
Arpu_5g and arpu_4g have a strong positive correlation (0.75). This suggests that customers who spend more on 5G services tend to spend more on 4G services as well.
In this project, we analyzed a telecom dataset with the aim of identifying insights and making recommendations for the business. We performed univariate, bivariate, and multivariate analyses to identify patterns and relationships between various variables.
Through EDA, we identified that the company lacks targeted marketing for specific customers, satisfaction scores are low, and the major reasons for churn were attitude of customer support and competition offering better deals.
Our future scope includes implementing machine learning algorithms for predictive analysis, conducting A/B testing, and continuously monitoring and improving customer satisfaction.
The company should focus on improving its customer support and services to reduce churn rates and increase customer satisfaction.
The company should use the demographic and usage patterns of its customers to create targeted marketing campaigns that are more likely to resonate with specific customer segments.
The company should consider offering incentives to customers who refer friends to use their services, as this can increase customer loyalty and bring in new customers.
The company should explore ways to increase revenue from customers who generate less revenue, such as by offering targeted promotions or service upgrades.
The company should continue to monitor its customer satisfaction score and take action to improve it.
Start coding or generate with AI.
If you are interested in longer runtimes with more lenient timeouts, you may want to check out Colab Pro.